{ "metadata": { "kernelspec": { "language": "python", "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python", "version": "3.7.12", "mimetype": "text/x-python", "codemirror_mode": { "name": "ipython", "version": 3 }, "pygments_lexer": "ipython3", "nbconvert_exporter": "python", "file_extension": ".py" }, "colab": { "name": "Lab09_Relational_Database_and_data_wrangling.ipynb", "provenance": [], "toc_visible": true, "collapsed_sections": [] } }, "nbformat_minor": 0, "nbformat": 4, "cells": [ { "cell_type": "markdown", "source": [ "**Lab 9 – Relational Database and data wrangling**" ], "metadata": { "id": "1QiCFLer1FIe" } }, { "cell_type": "markdown", "source": [ "_This notebook contains the sample from https://www.kaggle.com/learn/, https://github.com/ageron/handson-ml2_ and https://github.com/wesm/pydata-book." ], "metadata": { "id": "vCyq3-8y1FIj" } }, { "cell_type": "markdown", "source": [ "\n", " \n", " \n", "
\n", " \"Open\n", " \n", " \n", "
" ], "metadata": { "id": "9J5g6PDs1FIk" } }, { "cell_type": "code", "source": [ "from google.cloud import bigquery\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import matplotlib as mpl\n", "from matplotlib import pyplot as plt\n", "%matplotlib inline" ], "metadata": { "id": "5bV_HvPiH-9i", "execution": { "iopub.status.busy": "2022-04-23T03:33:52.948105Z", "iopub.execute_input": "2022-04-23T03:33:52.948724Z", "iopub.status.idle": "2022-04-23T03:33:52.976140Z", "shell.execute_reply.started": "2022-04-23T03:33:52.948606Z", "shell.execute_reply": "2022-04-23T03:33:52.975047Z" }, "trusted": true }, "execution_count": 1, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Queringing data with bigquery" ], "metadata": { "id": "6sb3RBiSo4Vf" } }, { "cell_type": "markdown", "source": [ "Structured Query Language, or SQL, is the programming language used with databases, and it is an important skill for any data scientist. In this example, you'll build your SQL skills using BigQuery, a web service work as database management system that lets you apply SQL to huge datasets." ], "metadata": { "id": "qvr5MUX8pFwO" } }, { "cell_type": "markdown", "source": [ "### Preliminaries for google colab (optional)\n", "\n", "We want to start exploring the Google BiqQuery [public datasets](https://cloud.google.com/bigquery/public-data/). Let's start by walking through the required setup steps, and then we can load and explore some data.\n", "\n", "If you are using colab. Follow [this quickstart guide](https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries), which will explain how to:\n", "1. Create a [Cloud Platform project](https://console.cloud.google.com/cloud-resource-manager) if you don't have one already.\n", "2. [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project\n", "3. [Enable the BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery)\n", "4. [Enabling the Service account](https://cloud.google.com/docs/authentication/getting-started)\n", "\n", "Now we need to authenticate to gain access to the BigQuery API. We will create a client, specifying the service account key file (replace 'utopian-datum-340514-9ffc23108bf4.json' with your key file)." ], "metadata": { "id": "OvNovWe2Hobw" } }, { "cell_type": "code", "source": [ "from google.oauth2 import service_account\n", "\n", "# TODO(developer): Set key_path to the path to the service account key\n", "# file.\n", "\n", "key_path = \"utopian-datum-340514-9ffc23108bf4.json\"\n", "\n", "credentials = service_account.Credentials.from_service_account_file(\n", " key_path, scopes=[\"https://www.googleapis.com/auth/cloud-platform\"],\n", ")" ], "metadata": { "id": "CLZXPyBqcysR" }, "execution_count": 5, "outputs": [] }, { "cell_type": "markdown", "source": [ "Now that we're authenticated, we need to load the BigQuery packag, and the `google.colab.data_table` package that can be used to display large pandas dataframes as an interactive data. Loading `data_table` is optional, but it will be useful for working with data in pandas." ], "metadata": { "id": "CFpSwmOJIDS_" } }, { "cell_type": "code", "source": [ "%load_ext google.cloud.bigquery\n", "%load_ext google.colab.data_table" ], "metadata": { "id": "BnVn82RDGyQu" }, "execution_count": 6, "outputs": [] }, { "cell_type": "code", "source": [ "client = bigquery.Client(credentials=credentials, project=credentials.project_id,)" ], "metadata": { "id": "vcmdamwKG3Et" }, "execution_count": 23, "outputs": [] }, { "cell_type": "markdown", "source": [ "### Create the reference\n" ], "metadata": { "id": "1UK6R1kBIaPC" } }, { "cell_type": "markdown", "source": [ "You can also work with Kaggle, which provide bigquery integration that you do not need to setup a google account. **Each Kaggle user can scan 5TB every 30 days for free. Once you hit that limit, you'll have to wait for it to reset.** See https://www.kaggle.com/product-feedback/48573 for more details.\n" ], "metadata": { "id": "ssJU-WK9MSSE" } }, { "cell_type": "markdown", "source": [ "The first step in the workflow is to create a [`Client`](https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client) object. As you'll soon see, this `Client` object will play a central role in retrieving information from BigQuery datasets." ], "metadata": { "id": "YRCSulcqCL_o" } }, { "cell_type": "code", "source": [ "# Create a \"Client\" object if you are using Kaggle\n", "client = bigquery.Client()" ], "metadata": { "id": "NLmMfMf4CO7t", "execution": { "iopub.status.busy": "2022-04-23T03:34:20.953692Z", "iopub.execute_input": "2022-04-23T03:34:20.953969Z", "iopub.status.idle": "2022-04-23T03:34:20.959198Z", "shell.execute_reply.started": "2022-04-23T03:34:20.953941Z", "shell.execute_reply": "2022-04-23T03:34:20.958252Z" }, "trusted": true }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "We'll work with a dataset of posts on Hacker News, a website focusing on computer science and cybersecurity news. In BigQuery, each dataset is contained in a corresponding project. In this case, our `hacker_news` dataset is contained in the `bigquery-public-data project`. \n", "\n", "To access the dataset, We begin by constructing a reference to the dataset with the [`dataset()`](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html?highlight=dataset#google.cloud.bigquery.client.Client.dataset) method. Next, we use the [`get_dataset()`](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html?highlight=get_dataset#google.cloud.bigquery.client.Client.get_dataset) method, along with the reference we just constructed, to fetch the dataset.\n", "\n", "[See the full list of public datasets](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset) or the [kaggle bigquery dataset](https://www.kaggle.com/datasets?search=bigquery) if you want to explore others." ], "metadata": { "id": "OgWXWB1ICXLi" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"hacker_news\" dataset\n", "dataset_ref = client.dataset(\"hacker_news\", project=\"bigquery-public-data\")\n", "\n", "# API request - fetch the dataset\n", "dataset = client.get_dataset(dataset_ref)" ], "metadata": { "id": "oC7ldHp4CRf-", "execution": { "iopub.status.busy": "2022-04-23T03:34:22.839652Z", "iopub.execute_input": "2022-04-23T03:34:22.840484Z", "iopub.status.idle": "2022-04-23T03:34:23.269423Z", "shell.execute_reply.started": "2022-04-23T03:34:22.840431Z", "shell.execute_reply": "2022-04-23T03:34:23.268733Z" }, "trusted": true }, "execution_count": 7, "outputs": [] }, { "cell_type": "markdown", "source": [ "Every dataset is just a collection of tables. You can think of a dataset as a spreadsheet file containing multiple tables, all composed of rows and columns.We use the [`list_tables()`](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html?highlight=list_tables#google.cloud.bigquery.client.Client.list_tables) method to list the tables in the dataset." ], "metadata": { "id": "M8Xms4W_I1QQ" } }, { "cell_type": "code", "source": [ "# List all the tables in the \"hacker_news\" dataset\n", "tables = list(client.list_tables(dataset))\n", "\n", "# Print names of all tables in the dataset (there are four!)\n", "for table in tables: \n", " print(table.table_id)" ], "metadata": { "id": "M2gOPT8jCvCL", "execution": { "iopub.status.busy": "2022-04-23T03:34:25.195022Z", "iopub.execute_input": "2022-04-23T03:34:25.195282Z", "iopub.status.idle": "2022-04-23T03:34:25.507541Z", "shell.execute_reply.started": "2022-04-23T03:34:25.195254Z", "shell.execute_reply": "2022-04-23T03:34:25.506966Z" }, "trusted": true, "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "a4400d27-47aa-4080-be31-6e174551b649" }, "execution_count": 8, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "comments\n", "full\n", "full_201510\n", "stories\n" ] } ] }, { "cell_type": "markdown", "source": [ "Similar to how we fetched a dataset, we can fetch a table. In the code cell below, we fetch the `full` table in the hacker_news dataset" ], "metadata": { "id": "EOCsUwyEJA0K" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"full\" table\n", "table_ref = dataset_ref.table(\"full\")\n", "\n", "# API request - fetch the table\n", "table = client.get_table(table_ref)" ], "metadata": { "id": "Q6JCjVCGI6ev", "execution": { "iopub.status.busy": "2022-04-23T03:34:26.126551Z", "iopub.execute_input": "2022-04-23T03:34:26.127256Z", "iopub.status.idle": "2022-04-23T03:34:26.318084Z", "shell.execute_reply.started": "2022-04-23T03:34:26.127219Z", "shell.execute_reply": "2022-04-23T03:34:26.317267Z" }, "trusted": true }, "execution_count": 9, "outputs": [] }, { "cell_type": "markdown", "source": [ "In the next section, you'll explore the contents of this table in more detail. For now, take the time to use the image below to consolidate what you've learned so far.\n", "\n", "![first_commands](https://i.imgur.com/biYqbUB.png)" ], "metadata": { "id": "EQ_-OP8cJtqo" } }, { "cell_type": "markdown", "source": [ "### Table schema\n", "\n", "The structure of a table is called its **schema**. **We need to understand a table's schema to effectively pull out the data we want.** \n", "\n", "In this example, we'll investigate the `full` table that we fetched above." ], "metadata": { "id": "RK2iKPr0J295" } }, { "cell_type": "code", "source": [ "# Print information on all the columns in the \"full\" table in the \"hacker_news\" dataset\n", "table.schema" ], "metadata": { "id": "oAcsp296JMPg", "execution": { "iopub.status.busy": "2022-04-23T03:34:27.440643Z", "iopub.execute_input": "2022-04-23T03:34:27.440936Z", "iopub.status.idle": "2022-04-23T03:34:27.448446Z", "shell.execute_reply.started": "2022-04-23T03:34:27.440908Z", "shell.execute_reply": "2022-04-23T03:34:27.447878Z" }, "trusted": true, "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "181ec15c-ad4d-472d-d147-ce134a101477" }, "execution_count": 10, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[SchemaField('title', 'STRING', 'NULLABLE', 'Story title', ()),\n", " SchemaField('url', 'STRING', 'NULLABLE', 'Story url', ()),\n", " SchemaField('text', 'STRING', 'NULLABLE', 'Story or comment text', ()),\n", " SchemaField('dead', 'BOOLEAN', 'NULLABLE', 'Is dead?', ()),\n", " SchemaField('by', 'STRING', 'NULLABLE', \"The username of the item's author.\", ()),\n", " SchemaField('score', 'INTEGER', 'NULLABLE', 'Story score', ()),\n", " SchemaField('time', 'INTEGER', 'NULLABLE', 'Unix time', ()),\n", " SchemaField('timestamp', 'TIMESTAMP', 'NULLABLE', 'Timestamp for the unix time', ()),\n", " SchemaField('type', 'STRING', 'NULLABLE', 'Type of details (comment, comment_ranking, poll, story, job, pollopt)', ()),\n", " SchemaField('id', 'INTEGER', 'NULLABLE', \"The item's unique id.\", ()),\n", " SchemaField('parent', 'INTEGER', 'NULLABLE', 'Parent comment ID', ()),\n", " SchemaField('descendants', 'INTEGER', 'NULLABLE', 'Number of story or poll descendants', ()),\n", " SchemaField('ranking', 'INTEGER', 'NULLABLE', 'Comment ranking', ()),\n", " SchemaField('deleted', 'BOOLEAN', 'NULLABLE', 'Is deleted?', ())]" ] }, "metadata": {}, "execution_count": 10 } ] }, { "cell_type": "markdown", "source": [ "Each [`SchemaField`](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.schema.SchemaField.html#google.cloud.bigquery.schema.SchemaField) tells us about a specific column (which we also refer to as a **field**). In order, the information is:\n", "\n", "* The **name** of the column\n", "* The **field type** (or datatype) in the column\n", "* The **mode** of the column (`'NULLABLE'` means that a column allows NULL values, and is the default)\n", "* A **description** of the data in that column\n", "\n", "For instance, the field has the SchemaField:\n", "\n", "`SchemaField('by', 'string', 'NULLABLE', \"The username of the item's author.\",())`\n", "\n", "This tells us:\n", "- the field (or column) is called `by`,\n", "- the data in this field is strings, \n", "- NULL values are allowed, and\n", "- it contains the usernames corresponding to each item's author.\n", "\n", "We can use the [`list_rows()`](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html?highlight=list_rows#google.cloud.bigquery.client.Client.list_rows) method to check just the first five lines of of the `full` table to make sure this is right. This returns a BigQuery [`RowIterator`](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.table.RowIterator.html?highlight=rowiterator#google.cloud.bigquery.table.RowIterator) object that can quickly be converted to a pandas DataFrame with the [`to_dataframe()`](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.table.RowIterator.html?highlight=to_dataframe#google.cloud.bigquery.table.RowIterator.to_dataframe) method." ], "metadata": { "id": "Pm6u6FfGKA8v" } }, { "cell_type": "code", "source": [ "# Preview the first five lines of the \"full\" table\n", "client.list_rows(table, max_results=5).to_dataframe()" ], "metadata": { "id": "0eFyTIvFJ7ML", "execution": { "iopub.status.busy": "2022-04-23T03:34:28.459739Z", "iopub.execute_input": "2022-04-23T03:34:28.460370Z", "iopub.status.idle": "2022-04-23T03:34:29.255110Z", "shell.execute_reply.started": "2022-04-23T03:34:28.460329Z", "shell.execute_reply": "2022-04-23T03:34:29.254606Z" }, "trusted": true, "colab": { "base_uri": "https://localhost:8080/", "height": 577 }, "outputId": "f6e80114-32ca-40e8-8624-87c482e9f96f" }, "execution_count": 11, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title url text dead \\\n", "0 None None The corruption isn't about hurting or hel... None \n", "1 None None I had the choice of a M1 or a Thinkpad (which ... None \n", "2 None None Having a phone with you that you keep turned o... None \n", "3 None None I expect I'll get infected eventually. I&... None \n", "4 None None I am trying to find more information such as h... None \n", "\n", " by score time timestamp type \\\n", "0 azernik None 1642311281 2022-01-16 05:34:41+00:00 comment \n", "1 dopeboy None 1642311284 2022-01-16 05:34:44+00:00 comment \n", "2 dane-pgp None 1642311221 2022-01-16 05:33:41+00:00 comment \n", "3 simsla None 1642311228 2022-01-16 05:33:48+00:00 comment \n", "4 hamiltonians None 1642311234 2022-01-16 05:33:54+00:00 comment \n", "\n", " id parent descendants ranking deleted \n", "0 29953698 29951182 None None None \n", "1 29953699 29950651 None None None \n", "2 29953694 29953567 None None None \n", "3 29953695 29953363 None None None \n", "4 29953696 29953072 None None None " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleurltextdeadbyscoretimetimestamptypeidparentdescendantsrankingdeleted
0NoneNoneThe corruption isn't about hurting or hel...NoneazernikNone16423112812022-01-16 05:34:41+00:00comment2995369829951182NoneNoneNone
1NoneNoneI had the choice of a M1 or a Thinkpad (which ...NonedopeboyNone16423112842022-01-16 05:34:44+00:00comment2995369929950651NoneNoneNone
2NoneNoneHaving a phone with you that you keep turned o...Nonedane-pgpNone16423112212022-01-16 05:33:41+00:00comment2995369429953567NoneNoneNone
3NoneNoneI expect I'll get infected eventually. I&...NonesimslaNone16423112282022-01-16 05:33:48+00:00comment2995369529953363NoneNoneNone
4NoneNoneI am trying to find more information such as h...NonehamiltoniansNone16423112342022-01-16 05:33:54+00:00comment2995369629953072NoneNoneNone
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n\"The corruption isn't about hurting or helping Tesla.

It's about Musk using his personal power and influence over Tesla to deter (the employees of) regulatory agencies from investigating him like any other citizen.\",\n{\n 'v': null,\n 'f': \"null\",\n },\n\"azernik\",\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 1642311281,\n 'f': \"1642311281\",\n },\n\"2022-01-16 05:34:41+00:00\",\n\"comment\",\n{\n 'v': 29953698,\n 'f': \"29953698\",\n },\n{\n 'v': 29951182,\n 'f': \"29951182\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n }],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n\"I had the choice of a M1 or a Thinkpad (which is what I've been using for the past decade). It was a very difficult decision - everything I hear about the M1 is incredible. I ended up getting a thinkpad because I really don't like macOS. But I don't know if I'll make that same decision in a couple years.\",\n{\n 'v': null,\n 'f': \"null\",\n },\n\"dopeboy\",\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 1642311284,\n 'f': \"1642311284\",\n },\n\"2022-01-16 05:34:44+00:00\",\n\"comment\",\n{\n 'v': 29953699,\n 'f': \"29953699\",\n },\n{\n 'v': 29950651,\n 'f': \"29950651\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n }],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n\"Having a phone with you that you keep turned off most of the time would be a good way to learn what sort of things you really need it for. Obviously you wouldn't be able to receive incoming calls, but how much do you use your phone to browse the web, or as a map? If you can time-shift those desires (until you are at a computer) or satisfy them with other devices, then you'll be able to go longer and longer without turning the phone on, but still have it for emergencies.

Personally I'd settle for a phone which doesn't connect to the mobile network unless I'm dialling out, or only connects when I'm in certain locations at certain times. Perhaps using a VoIP service and automatically connecting to specific trusted Wi-Fi networks would suffice for that use case, however I've often wondered if it would be possible for a mobile network provider to also operate an FM radio station, which would broadcast a pre-agreed code specific to one of their users whenever that user had an incoming call. I don't know how much battery it would drain for a phone to be constantly scanning FM radio data, though.\",\n{\n 'v': null,\n 'f': \"null\",\n },\n\"dane-pgp\",\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 1642311221,\n 'f': \"1642311221\",\n },\n\"2022-01-16 05:33:41+00:00\",\n\"comment\",\n{\n 'v': 29953694,\n 'f': \"29953694\",\n },\n{\n 'v': 29953567,\n 'f': \"29953567\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n }],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n\"I expect I'll get infected eventually. I'd prefer not to get infected at a point when the system is (about to be) overloaded.

Omicron can't be meaningfully stopped, but it can be rate limited.\",\n{\n 'v': null,\n 'f': \"null\",\n },\n\"simsla\",\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 1642311228,\n 'f': \"1642311228\",\n },\n\"2022-01-16 05:33:48+00:00\",\n\"comment\",\n{\n 'v': 29953695,\n 'f': \"29953695\",\n },\n{\n 'v': 29953363,\n 'f': \"29953363\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n }],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n\"I am trying to find more information such as how they obtain the livestream views or how they hack the twitter accounts. Hundreds of articles have been written about this but no insight as to where the livestream views come from. Are the livestream from proxies or some sort of browser hijack.

Also, almost everyone by now is aware of the scam given all the news coverage it over the past few years, so I don't see the need to repeat myself again.

Google "crypto giveaway scam YouTube" no quotes for more info.\",\n{\n 'v': null,\n 'f': \"null\",\n },\n\"hamiltonians\",\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 1642311234,\n 'f': \"1642311234\",\n },\n\"2022-01-16 05:33:54+00:00\",\n\"comment\",\n{\n 'v': 29953696,\n 'f': \"29953696\",\n },\n{\n 'v': 29953072,\n 'f': \"29953072\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n }]],\n columns: [[\"number\", \"index\"], [\"number\", \"title\"], [\"number\", \"url\"], [\"string\", \"text\"], [\"number\", \"dead\"], [\"string\", \"by\"], [\"number\", \"score\"], [\"number\", \"time\"], [\"string\", \"timestamp\"], [\"string\", \"type\"], [\"number\", \"id\"], [\"number\", \"parent\"], [\"number\", \"descendants\"], [\"number\", \"ranking\"], [\"number\", \"deleted\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 11 } ] }, { "cell_type": "markdown", "source": [ "The `list_rows()` method will also let us look at just the information in a specific column. If we want to see the first five entries in the `by` column, for example, we can do that!" ], "metadata": { "id": "tfGNKgW3KxWW" } }, { "cell_type": "code", "source": [ "# Preview the first five entries in the \"by\" column of the \"full\" table\n", "client.list_rows(table, selected_fields=table.schema[4:5], max_results=5).to_dataframe()" ], "metadata": { "id": "ghYSN97rKc6f", "execution": { "iopub.status.busy": "2022-04-23T03:34:30.428333Z", "iopub.execute_input": "2022-04-23T03:34:30.428784Z", "iopub.status.idle": "2022-04-23T03:34:30.823825Z", "shell.execute_reply.started": "2022-04-23T03:34:30.428728Z", "shell.execute_reply": "2022-04-23T03:34:30.823228Z" }, "trusted": true, "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "c829a51b-8abe-4421-97b8-696145d9562f" }, "execution_count": 12, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " by\n", "0 azernik\n", "1 dopeboy\n", "2 dane-pgp\n", "3 simsla\n", "4 hamiltonians" ], "text/html": [ "\n", "

\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
by
0azernik
1dopeboy
2dane-pgp
3simsla
4hamiltonians
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"azernik\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"dopeboy\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"dane-pgp\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"simsla\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"hamiltonians\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"by\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 12 } ] }, { "cell_type": "markdown", "source": [ "### Select, From & Where" ], "metadata": { "id": "83w8rxkkM5TE" } }, { "cell_type": "markdown", "source": [ "Now that you know how to access and examine a dataset, you're ready to write your first SQL query! As you'll soon see, **SQL queries will help you sort through a massive dataset, to retrieve only the information that you need.** We'll begin by using the keywords **SELECT**, **FROM**, and **WHERE** to get data from specific columns based on conditions you specify. " ], "metadata": { "id": "wrRPIX4PMsKK" } }, { "cell_type": "markdown", "source": [ "We'll use an [OpenAQ](https://openaq.org) dataset about air quality. First, we'll set up everything we need to run queries and take a quick peek at what tables are in our database." ], "metadata": { "id": "yZwYMG_FNm5S" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"openaq\" dataset\n", "dataset_ref = client.dataset(\"openaq\", project=\"bigquery-public-data\")\n", "\n", "# API request - fetch the dataset\n", "dataset = client.get_dataset(dataset_ref)\n", "\n", "# List all the tables in the \"openaq\" dataset\n", "tables = list(client.list_tables(dataset))\n", "\n", "# Print names of all tables in the dataset (there's only one!)\n", "for table in tables: \n", " print(table.table_id)" ], "metadata": { "id": "72Zhkk0wK3oN", "execution": { "iopub.status.busy": "2022-04-23T03:34:34.103065Z", "iopub.execute_input": "2022-04-23T03:34:34.103476Z", "iopub.status.idle": "2022-04-23T03:34:34.569656Z", "shell.execute_reply.started": "2022-04-23T03:34:34.103437Z", "shell.execute_reply": "2022-04-23T03:34:34.568998Z" }, "trusted": true, "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "6ffdfcb1-d260-4312-f359-d18210d6ef58" }, "execution_count": 13, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "global_air_quality\n" ] } ] }, { "cell_type": "markdown", "source": [ "The dataset contains only one table, called `global_air_quality`. We'll fetch the table and take a peek at the first few rows to see what sort of data it contains." ], "metadata": { "id": "x1czVLqAOEcs" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"global_air_quality\" table\n", "table_ref = dataset_ref.table(\"global_air_quality\")\n", "\n", "# API request - fetch the table\n", "table = client.get_table(table_ref)\n", "\n", "# Preview the first five lines of the \"global_air_quality\" table\n", "client.list_rows(table, max_results=5).to_dataframe()" ], "metadata": { "id": "kYAuivzjOAM9", "execution": { "iopub.status.busy": "2022-04-23T03:34:36.196047Z", "iopub.execute_input": "2022-04-23T03:34:36.196746Z", "iopub.status.idle": "2022-04-23T03:34:36.908751Z", "shell.execute_reply.started": "2022-04-23T03:34:36.196712Z", "shell.execute_reply": "2022-04-23T03:34:36.907911Z" }, "trusted": true, "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "7ca3142f-93ab-42c5-82b5-13e8b15145fd" }, "execution_count": 14, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " location city country pollutant value timestamp \\\n", "0 MA Salt Lake City US no2 0.003 2020-01-06 18:00:00+00:00 \n", "1 MA Salt Lake City US so2 0.001 2020-01-06 18:00:00+00:00 \n", "2 MA Salt Lake City US o3 0.039 2020-01-06 18:00:00+00:00 \n", "3 MA Salt Lake City US pm25 1.300 2020-01-06 18:00:00+00:00 \n", "4 NR Salt Lake City US no2 0.013 2020-06-11 00:00:00+00:00 \n", "\n", " unit source_name latitude longitude averaged_over_in_hours \\\n", "0 ppm AirNow 40.712063 -112.111120 1.0 \n", "1 ppm AirNow 40.712063 -112.111120 1.0 \n", "2 ppm AirNow 40.712063 -112.111120 1.0 \n", "3 µg/m³ AirNow 40.712063 -112.111120 1.0 \n", "4 ppm AirNow 40.662840 -111.901794 1.0 \n", "\n", " location_geom \n", "0 POINT(-112.11112 40.712063) \n", "1 POINT(-112.11112 40.712063) \n", "2 POINT(-112.11112 40.712063) \n", "3 POINT(-112.11112 40.712063) \n", "4 POINT(-111.901794 40.66284) " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
locationcitycountrypollutantvaluetimestampunitsource_namelatitudelongitudeaveraged_over_in_hourslocation_geom
0MASalt Lake CityUSno20.0032020-01-06 18:00:00+00:00ppmAirNow40.712063-112.1111201.0POINT(-112.11112 40.712063)
1MASalt Lake CityUSso20.0012020-01-06 18:00:00+00:00ppmAirNow40.712063-112.1111201.0POINT(-112.11112 40.712063)
2MASalt Lake CityUSo30.0392020-01-06 18:00:00+00:00ppmAirNow40.712063-112.1111201.0POINT(-112.11112 40.712063)
3MASalt Lake CityUSpm251.3002020-01-06 18:00:00+00:00µg/m³AirNow40.712063-112.1111201.0POINT(-112.11112 40.712063)
4NRSalt Lake CityUSno20.0132020-06-11 00:00:00+00:00ppmAirNow40.662840-111.9017941.0POINT(-111.901794 40.66284)
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"MA\",\n\"Salt Lake City\",\n\"US\",\n\"no2\",\n{\n 'v': 0.003,\n 'f': \"0.003\",\n },\n\"2020-01-06 18:00:00+00:00\",\n\"ppm\",\n\"AirNow\",\n{\n 'v': 40.712063,\n 'f': \"40.712063\",\n },\n{\n 'v': -112.11112,\n 'f': \"-112.11112\",\n },\n{\n 'v': 1.0,\n 'f': \"1.0\",\n },\n\"POINT(-112.11112 40.712063)\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"MA\",\n\"Salt Lake City\",\n\"US\",\n\"so2\",\n{\n 'v': 0.001,\n 'f': \"0.001\",\n },\n\"2020-01-06 18:00:00+00:00\",\n\"ppm\",\n\"AirNow\",\n{\n 'v': 40.712063,\n 'f': \"40.712063\",\n },\n{\n 'v': -112.11112,\n 'f': \"-112.11112\",\n },\n{\n 'v': 1.0,\n 'f': \"1.0\",\n },\n\"POINT(-112.11112 40.712063)\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"MA\",\n\"Salt Lake City\",\n\"US\",\n\"o3\",\n{\n 'v': 0.039,\n 'f': \"0.039\",\n },\n\"2020-01-06 18:00:00+00:00\",\n\"ppm\",\n\"AirNow\",\n{\n 'v': 40.712063,\n 'f': \"40.712063\",\n },\n{\n 'v': -112.11112,\n 'f': \"-112.11112\",\n },\n{\n 'v': 1.0,\n 'f': \"1.0\",\n },\n\"POINT(-112.11112 40.712063)\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"MA\",\n\"Salt Lake City\",\n\"US\",\n\"pm25\",\n{\n 'v': 1.3,\n 'f': \"1.3\",\n },\n\"2020-01-06 18:00:00+00:00\",\n\"\\u00b5g/m\\u00b3\",\n\"AirNow\",\n{\n 'v': 40.712063,\n 'f': \"40.712063\",\n },\n{\n 'v': -112.11112,\n 'f': \"-112.11112\",\n },\n{\n 'v': 1.0,\n 'f': \"1.0\",\n },\n\"POINT(-112.11112 40.712063)\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"NR\",\n\"Salt Lake City\",\n\"US\",\n\"no2\",\n{\n 'v': 0.013,\n 'f': \"0.013\",\n },\n\"2020-06-11 00:00:00+00:00\",\n\"ppm\",\n\"AirNow\",\n{\n 'v': 40.66284,\n 'f': \"40.66284\",\n },\n{\n 'v': -111.901794,\n 'f': \"-111.901794\",\n },\n{\n 'v': 1.0,\n 'f': \"1.0\",\n },\n\"POINT(-111.901794 40.66284)\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"location\"], [\"string\", \"city\"], [\"string\", \"country\"], [\"string\", \"pollutant\"], [\"number\", \"value\"], [\"string\", \"timestamp\"], [\"string\", \"unit\"], [\"string\", \"source_name\"], [\"number\", \"latitude\"], [\"number\", \"longitude\"], [\"number\", \"averaged_over_in_hours\"], [\"string\", \"location_geom\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 14 } ] }, { "cell_type": "markdown", "source": [ "let's put together a query. Say we want to select all the values from the `city` column that are in rows where the `country` column is `'US'` (for \"United States\")." ], "metadata": { "id": "1g2H5wNrOJrB" } }, { "cell_type": "code", "source": [ "# Query to select all the items from the \"city\" column where the \"country\" column is 'US'\n", "# SQL is almost completely case and indentation insensitive. The capitalization and\n", "# indentation style here is preferred style.\n", "query = \"\"\"\n", " SELECT city\n", " FROM `bigquery-public-data.openaq.global_air_quality`\n", " WHERE country = 'US'\n", " \"\"\"" ], "metadata": { "id": "HNVL36G6OI9z", "execution": { "iopub.status.busy": "2022-04-23T03:34:37.669408Z", "iopub.execute_input": "2022-04-23T03:34:37.669919Z", "iopub.status.idle": "2022-04-23T03:34:37.674122Z", "shell.execute_reply.started": "2022-04-23T03:34:37.669880Z", "shell.execute_reply": "2022-04-23T03:34:37.673374Z" }, "trusted": true }, "execution_count": 15, "outputs": [] }, { "cell_type": "markdown", "source": [ "Notice also that SQL statements requires single quotes for its strings inside python string (we use triple quotation mark here). We begin by setting up the query with the [`query()`](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html?highlight=query#google.cloud.bigquery.client.Client.query) method." ], "metadata": { "id": "lrUVBJqmOfgD" } }, { "cell_type": "code", "source": [ "# Set up the query\n", "query_job = client.query(query)\n", "\n", "# API request - run the query, and return a pandas DataFrame\n", "us_cities = query_job.to_dataframe()" ], "metadata": { "id": "3P7iBjrQbxC5", "execution": { "iopub.status.busy": "2022-04-23T03:35:55.758822Z", "iopub.execute_input": "2022-04-23T03:35:55.759311Z", "iopub.status.idle": "2022-04-23T03:35:59.638985Z", "shell.execute_reply.started": "2022-04-23T03:35:55.759280Z", "shell.execute_reply": "2022-04-23T03:35:59.638236Z" }, "trusted": true }, "execution_count": 19, "outputs": [] }, { "cell_type": "markdown", "source": [ "Now we've got a pandas DataFrame called `us_cities`, which we can use like any other DataFrame." ], "metadata": { "id": "28V2b3IMderS" } }, { "cell_type": "code", "source": [ "# What five cities have the most measurements?\n", "us_cities.city.value_counts().head()" ], "metadata": { "id": "GBNcJ4NtdeAm", "execution": { "iopub.status.busy": "2022-04-23T03:36:03.918735Z", "iopub.execute_input": "2022-04-23T03:36:03.919522Z", "iopub.status.idle": "2022-04-23T03:36:03.937814Z", "shell.execute_reply.started": "2022-04-23T03:36:03.919480Z", "shell.execute_reply": "2022-04-23T03:36:03.936847Z" }, "trusted": true, "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "703cd3cc-8759-4cdb-dba0-3db7a3d248b3" }, "execution_count": 17, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Phoenix-Mesa-Scottsdale 2147\n", "Riverside-San Bernardino-Ontario 2138\n", "Los Angeles-Long Beach-Santa Ana 1656\n", "New York-Northern New Jersey-Long Island 1433\n", "San Francisco-Oakland-Fremont 1337\n", "Name: city, dtype: int64" ] }, "metadata": {}, "execution_count": 17 } ] }, { "cell_type": "markdown", "source": [ "If you want multiple columns, you can select them with a comma between the names:" ], "metadata": { "id": "yDRyu1cR4fLU" } }, { "cell_type": "code", "source": [ "query = \"\"\"\n", " SELECT city, country\n", " FROM `bigquery-public-data.openaq.global_air_quality`\n", " WHERE country = 'US'\n", " \"\"\"" ], "metadata": { "execution": { "iopub.status.busy": "2022-04-23T03:40:51.487848Z", "iopub.execute_input": "2022-04-23T03:40:51.488176Z", "iopub.status.idle": "2022-04-23T03:40:51.492074Z", "shell.execute_reply.started": "2022-04-23T03:40:51.488143Z", "shell.execute_reply": "2022-04-23T03:40:51.491286Z" }, "trusted": true, "id": "Ixf_qpmo4fLU" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "You can select all columns with a `*` like this:" ], "metadata": { "id": "3BNBGKPC4fLU" } }, { "cell_type": "code", "source": [ "query = \"\"\"\n", " SELECT *\n", " FROM `bigquery-public-data.openaq.global_air_quality`\n", " WHERE country = 'US'\n", " \"\"\"" ], "metadata": { "id": "LQ29UlAi4fLU" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "### Querying big dataset" ], "metadata": { "id": "w2hxfX8Q4fLV" } }, { "cell_type": "markdown", "source": [ "You can estimate the size of any query before running it. Here is an example using the Hacker News dataset. To see how much data a query will scan, we create a [`QueryJobConfig`](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJobConfig.html?highlight=queryjobconfig#google.cloud.bigquery.job.QueryJobConfig) object and set the `dry_run` parameter to `True`." ], "metadata": { "id": "u0R_qYE64fLV" } }, { "cell_type": "code", "source": [ "# Query to get the score column from every row where the type column has value \"job\"\n", "query = \"\"\"\n", " SELECT score, title\n", " FROM `bigquery-public-data.hacker_news.full`\n", " WHERE type = \"job\" \n", " \"\"\"\n", "\n", "# Create a QueryJobConfig object to estimate size of query without running it\n", "dry_run_config = bigquery.QueryJobConfig(dry_run=True)\n", "\n", "# API request - dry run query to estimate costs\n", "dry_run_query_job = client.query(query, job_config=dry_run_config)\n", "\n", "print(\"This query will process {} bytes.\".format(dry_run_query_job.total_bytes_processed))" ], "metadata": { "execution": { "iopub.status.busy": "2022-04-23T03:43:43.636879Z", "iopub.execute_input": "2022-04-23T03:43:43.637619Z", "iopub.status.idle": "2022-04-23T03:43:44.083153Z", "shell.execute_reply.started": "2022-04-23T03:43:43.637561Z", "shell.execute_reply": "2022-04-23T03:43:44.082302Z" }, "trusted": true, "id": "wI2YIn_74fLV", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "1209729a-5595-4b2e-dd0e-59ba3c158f93" }, "execution_count": 20, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "This query will process 520947826 bytes.\n" ] } ] }, { "cell_type": "markdown", "source": [ "You can also specify a parameter when running the query to limit how much data you are willing to scan. Here's an example with a low limit." ], "metadata": { "id": "nkAUYjlr4fLV" } }, { "cell_type": "code", "source": [ "# Only run the query if it's less than 1 MB\n", "ONE_MB = 1000*1000\n", "safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_MB)\n", "\n", "# Set up the query (will only run if it's less than 1 MB)\n", "safe_query_job = client.query(query, job_config=safe_config)\n", "\n", "# API request - try to run the query, and return a pandas DataFrame\n", "safe_query_job.to_dataframe()" ], "metadata": { "execution": { "iopub.status.busy": "2022-04-23T03:44:14.959744Z", "iopub.execute_input": "2022-04-23T03:44:14.960252Z", "iopub.status.idle": "2022-04-23T03:44:15.424113Z", "shell.execute_reply.started": "2022-04-23T03:44:14.960218Z", "shell.execute_reply": "2022-04-23T03:44:15.422855Z" }, "trusted": true, "id": "u2DeZlki4fLV", "colab": { "base_uri": "https://localhost:8080/", "height": 556 }, "outputId": "3a0ce9e7-ea73-46f0-d116-4945422bec40" }, "execution_count": 21, "outputs": [ { "output_type": "error", "ename": "InternalServerError", "evalue": "ignored", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mInternalServerError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;31m# API request - try to run the query, and return a pandas DataFrame\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0msafe_query_job\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_dataframe\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/google/cloud/bigquery/job.py\u001b[0m in \u001b[0;36mto_dataframe\u001b[0;34m(self, bqstorage_client, dtypes, progress_bar_type)\u001b[0m\n\u001b[1;32m 3103\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mIf\u001b[0m \u001b[0mthe\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m`\u001b[0m\u001b[0mpandas\u001b[0m\u001b[0;31m`\u001b[0m \u001b[0mlibrary\u001b[0m \u001b[0mcannot\u001b[0m \u001b[0mbe\u001b[0m \u001b[0mimported\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3104\u001b[0m \"\"\"\n\u001b[0;32m-> 3105\u001b[0;31m return self.result().to_dataframe(\n\u001b[0m\u001b[1;32m 3106\u001b[0m \u001b[0mbqstorage_client\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mbqstorage_client\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3107\u001b[0m \u001b[0mdtypes\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtypes\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/google/cloud/bigquery/job.py\u001b[0m in \u001b[0;36mresult\u001b[0;34m(self, timeout, page_size, retry, max_results)\u001b[0m\n\u001b[1;32m 2972\u001b[0m \"\"\"\n\u001b[1;32m 2973\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2974\u001b[0;31m \u001b[0msuper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mQueryJob\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtimeout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtimeout\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2975\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2976\u001b[0m \u001b[0;31m# Return an iterator instead of returning the job.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/google/cloud/bigquery/job.py\u001b[0m in \u001b[0;36mresult\u001b[0;34m(self, timeout, retry)\u001b[0m\n\u001b[1;32m 766\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_begin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mretry\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mretry\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 767\u001b[0m \u001b[0;31m# TODO: modify PollingFuture so it can pass a retry argument to done().\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 768\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0msuper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_AsyncJob\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtimeout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtimeout\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 769\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 770\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mcancelled\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/google/api_core/future/polling.py\u001b[0m in \u001b[0;36mresult\u001b[0;34m(self, timeout, retry)\u001b[0m\n\u001b[1;32m 133\u001b[0m \u001b[0;31m# pylint: disable=raising-bad-type\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 134\u001b[0m \u001b[0;31m# Pylint doesn't recognize that this is valid in this case.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 135\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_exception\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 136\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 137\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_result\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mInternalServerError\u001b[0m: 500 Query exceeded limit for bytes billed: 1000000. 521142272 or higher required.\n\n(job ID: 9ca85173-7397-408a-b6fc-204ffe97d7b5)\n\n -----Query Job SQL Follows----- \n\n | . | . | . | . | . |\n 1:\n 2: SELECT score, title\n 3: FROM `bigquery-public-data.hacker_news.full`\n 4: WHERE type = \"job\" \n 5: \n | . | . | . | . | . |" ] } ] }, { "cell_type": "markdown", "source": [ "In this case, the query was cancelled, because the limit of 1 MB was exceeded. However, we can also increase the limit to run the query successfully!" ], "metadata": { "id": "1Zs8R3_I4fLV" } }, { "cell_type": "markdown", "source": [ "### Group By, Having & Count" ], "metadata": { "id": "-H9JRY8H6y6Q" } }, { "cell_type": "markdown", "source": [ "Now that you can select raw data, you're ready to learn how to group your data and count things within those groups." ], "metadata": { "id": "vjHQ4EZe4fLV" } }, { "cell_type": "markdown", "source": [ "The Hacker News dataset contains information on stories and comments from the Hacker News social networking site. We'll work with the `comments` table and begin by printing the first few rows" ], "metadata": { "id": "jlxSqT5I4fLW" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"hacker_news\" dataset\n", "dataset_ref = client.dataset(\"hacker_news\", project=\"bigquery-public-data\")\n", "\n", "# API request - fetch the dataset\n", "dataset = client.get_dataset(dataset_ref)\n", "\n", "# Construct a reference to the \"comments\" table\n", "table_ref = dataset_ref.table(\"comments\")\n", "\n", "# API request - fetch the table\n", "table = client.get_table(table_ref)\n", "\n", "# Preview the first five lines of the \"comments\" table\n", "client.list_rows(table, max_results=5).to_dataframe()" ], "metadata": { "execution": { "iopub.status.busy": "2022-04-23T03:50:17.465447Z", "iopub.execute_input": "2022-04-23T03:50:17.465984Z", "iopub.status.idle": "2022-04-23T03:50:18.444486Z", "shell.execute_reply.started": "2022-04-23T03:50:17.465945Z", "shell.execute_reply": "2022-04-23T03:50:18.443937Z" }, "trusted": true, "id": "lbhZ1WrH4fLW", "colab": { "base_uri": "https://localhost:8080/", "height": 346 }, "outputId": "c305c389-6ff7-4328-dca3-0e8731440f48" }, "execution_count": 24, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " id by author time time_ts \\\n", "0 2701393 5l 5l 1309184881 2011-06-27 14:28:01+00:00 \n", "1 5811403 99 99 1370234048 2013-06-03 04:34:08+00:00 \n", "2 21623 AF AF 1178992400 2007-05-12 17:53:20+00:00 \n", "3 10159727 EA EA 1441206574 2015-09-02 15:09:34+00:00 \n", "4 2988424 Iv Iv 1315853580 2011-09-12 18:53:00+00:00 \n", "\n", " text parent deleted dead \\\n", "0 And the glazier who fixed all the broken windo... 2701243 None None \n", "1 Does canada have the equivalent of H1B/Green c... 5804452 None None \n", "2 Speaking of Rails, there are other options in ... 21611 None None \n", "3 Humans and large livestock (and maybe even pet... 10159396 None None \n", "4 I must say I reacted in the same way when I re... 2988179 None None \n", "\n", " ranking \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idbyauthortimetime_tstextparentdeleteddeadranking
027013935l5l13091848812011-06-27 14:28:01+00:00And the glazier who fixed all the broken windo...2701243NoneNone0
15811403999913702340482013-06-03 04:34:08+00:00Does canada have the equivalent of H1B/Green c...5804452NoneNone0
221623AFAF11789924002007-05-12 17:53:20+00:00Speaking of Rails, there are other options in ...21611NoneNone0
310159727EAEA14412065742015-09-02 15:09:34+00:00Humans and large livestock (and maybe even pet...10159396NoneNone0
42988424IvIv13158535802011-09-12 18:53:00+00:00I must say I reacted in the same way when I re...2988179NoneNone0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 2701393,\n 'f': \"2701393\",\n },\n\"5l\",\n\"5l\",\n{\n 'v': 1309184881,\n 'f': \"1309184881\",\n },\n\"2011-06-27 14:28:01+00:00\",\n\"And the glazier who fixed all the broken windows also left his money to good causes.\",\n{\n 'v': 2701243,\n 'f': \"2701243\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n }],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 5811403,\n 'f': \"5811403\",\n },\n\"99\",\n\"99\",\n{\n 'v': 1370234048,\n 'f': \"1370234048\",\n },\n\"2013-06-03 04:34:08+00:00\",\n\"Does canada have the equivalent of H1B/Green card for work sponsorship? What do you think of that?\",\n{\n 'v': 5804452,\n 'f': \"5804452\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n }],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': 21623,\n 'f': \"21623\",\n },\n\"AF\",\n\"AF\",\n{\n 'v': 1178992400,\n 'f': \"1178992400\",\n },\n\"2007-05-12 17:53:20+00:00\",\n\"Speaking of Rails, there are other options in the Python world besides Django.

Pylons is a very Rails-y framework with the difference being that it is made to be easy to customize. In Rails if you don't like something you are going to have a hard time changing it out unless you are a good hacker. In Pylons that is easy, and you've got access to Python's vastly better platform (speed, Unicode support) and libraries.

If you are an absolute beginning programmer it might be kind of hard to pick up, but if you've programmed a bit or you've used one or two web frameworks (especially Rails) Pylons won't be hard to learn.

http://pylonshq.com/<\\/a>\",\n{\n 'v': 21611,\n 'f': \"21611\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n }],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': 10159727,\n 'f': \"10159727\",\n },\n\"EA\",\n\"EA\",\n{\n 'v': 1441206574,\n 'f': \"1441206574\",\n },\n\"2015-09-02 15:09:34+00:00\",\n\"Humans and large livestock (and maybe even pets) will have health monitoring devices embedded into their bodies in the near future. The devices will save the insurance companies money. Savings on insurance premiums will be the incentive to encourage mass adoption by citizens and owners of livestock.\",\n{\n 'v': 10159396,\n 'f': \"10159396\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n }],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': 2988424,\n 'f': \"2988424\",\n },\n\"Iv\",\n\"Iv\",\n{\n 'v': 1315853580,\n 'f': \"1315853580\",\n },\n\"2011-09-12 18:53:00+00:00\",\n\"I must say I reacted in the same way when I read about Madoff. The fact that people who are supposed to inspect investments would fall for such a scheme was one of the first nails that was put in the esteem I had for economy specialists.\",\n{\n 'v': 2988179,\n 'f': \"2988179\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n }]],\n columns: [[\"number\", \"index\"], [\"number\", \"id\"], [\"string\", \"by\"], [\"string\", \"author\"], [\"number\", \"time\"], [\"string\", \"time_ts\"], [\"string\", \"text\"], [\"number\", \"parent\"], [\"number\", \"deleted\"], [\"number\", \"dead\"], [\"number\", \"ranking\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 24 } ] }, { "cell_type": "markdown", "source": [ "Let's use the table to see which comments generated the most replies. Since:\n", "- the `parent` column indicates the comment that was replied to, and \n", "- the `id` column has the unique ID used to identify each comment, \n", "\n", "we can **GROUP BY** the `parent` column and **COUNT()** the `id` column in order to figure out the number of comments that were made as responses to a specific comment.\n", "\n", "Furthermore, since we're only interested in popular comments, we'll look at comments with more than ten replies. So, we'll only return groups **HAVING** more than ten ID's." ], "metadata": { "id": "5xQZeTKP4fLW" } }, { "cell_type": "code", "source": [ "# Query to select comments that received more than 10 replies\n", "query_popular = \"\"\"\n", " SELECT parent, COUNT(id)\n", " FROM `bigquery-public-data.hacker_news.comments`\n", " GROUP BY parent\n", " HAVING COUNT(id) > 10\n", " \"\"\"" ], "metadata": { "execution": { "iopub.status.busy": "2022-04-23T03:51:01.675659Z", "iopub.execute_input": "2022-04-23T03:51:01.675922Z", "iopub.status.idle": "2022-04-23T03:51:01.682999Z", "shell.execute_reply.started": "2022-04-23T03:51:01.675896Z", "shell.execute_reply": "2022-04-23T03:51:01.682129Z" }, "trusted": true, "id": "GTGRV8GE4fLW" }, "execution_count": 25, "outputs": [] }, { "cell_type": "code", "source": [ "# Set up the query (cancel the query if it would use too much of \n", "# your quota, with the limit set to 10 GB)\n", "safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)\n", "query_job = client.query(query_popular, job_config=safe_config)\n", "\n", "# API request - run the query, and convert the results to a pandas DataFrame\n", "popular_comments = query_job.to_dataframe()\n", "\n", "# Print the first five rows of the DataFrame\n", "popular_comments.head()" ], "metadata": { "execution": { "iopub.status.busy": "2022-04-23T03:51:25.840649Z", "iopub.execute_input": "2022-04-23T03:51:25.841060Z", "iopub.status.idle": "2022-04-23T03:51:32.497782Z", "shell.execute_reply.started": "2022-04-23T03:51:25.841031Z", "shell.execute_reply": "2022-04-23T03:51:32.496981Z" }, "trusted": true, "id": "Wb7hgSV24fLW", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "99505207-0fd9-4195-ef19-61cee55ad4c1" }, "execution_count": 29, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " parent f0_\n", "0 4332978 53\n", "1 2970550 63\n", "2 3353593 68\n", "3 3734303 56\n", "4 5048699 61" ], "text/html": [ "\n", "

\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
parentf0_
0433297853
1297055063
2335359368
3373430356
4504869961
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 4332978,\n 'f': \"4332978\",\n },\n{\n 'v': 53,\n 'f': \"53\",\n }],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 2970550,\n 'f': \"2970550\",\n },\n{\n 'v': 63,\n 'f': \"63\",\n }],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': 3353593,\n 'f': \"3353593\",\n },\n{\n 'v': 68,\n 'f': \"68\",\n }],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': 3734303,\n 'f': \"3734303\",\n },\n{\n 'v': 56,\n 'f': \"56\",\n }],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': 5048699,\n 'f': \"5048699\",\n },\n{\n 'v': 61,\n 'f': \"61\",\n }]],\n columns: [[\"number\", \"index\"], [\"number\", \"parent\"], [\"number\", \"f0_\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 29 } ] }, { "cell_type": "code", "source": [ "popular_comments" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 441 }, "id": "TVI78T52tigN", "outputId": "5d33a14c-5412-4678-a52a-7f6f863d07f7" }, "execution_count": 30, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Warning: total number of rows (77368) exceeds max_rows (20000). Falling back to pandas display.\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " parent f0_\n", "0 4332978 53\n", "1 2970550 63\n", "2 3353593 68\n", "3 3734303 56\n", "4 5048699 61\n", "... ... ...\n", "77363 1659020 37\n", "77364 10180728 37\n", "77365 9751539 37\n", "77366 7978163 37\n", "77367 4417571 37\n", "\n", "[77368 rows x 2 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
parentf0_
0433297853
1297055063
2335359368
3373430356
4504869961
.........
77363165902037
773641018072837
77365975153937
77366797816337
77367441757137
\n", "

77368 rows × 2 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 30 } ] }, { "cell_type": "markdown", "source": [ "Each row in the `popular_comments` DataFrame corresponds to a comment that received more than ten replies. For instance, the comment with ID `4332978` received `53` replies." ], "metadata": { "id": "j1eXKXzN4fLX" } }, { "cell_type": "markdown", "source": [ "A couple hints to make your queries even better:\n", "- The column resulting from `COUNT(id)` was called `f0__`. That's not a very descriptive name. You can change the name by adding `AS NumPosts` after you specify the aggregation. This is called **aliasing**.\n", "- If you are ever unsure what to put inside the **COUNT()** function, you can do `COUNT(1)` to count the rows in each group. Most people find it especially readable, because we know it's not focusing on other columns. It also scans less data than if supplied column names (making it faster and using less of your data access quota).\n", "\n", "Using these tricks, we can rewrite our query:" ], "metadata": { "id": "UVvf6W2c9LbB" } }, { "cell_type": "code", "source": [ "# Improved version of earlier query, now with aliasing & improved readability\n", "query_improved = \"\"\"\n", " SELECT parent, COUNT(1) AS NumPosts\n", " FROM `bigquery-public-data.hacker_news.comments`\n", " GROUP BY parent\n", " HAVING COUNT(1) > 10\n", " \"\"\"\n", "\n", "safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)\n", "query_job = client.query(query_improved, job_config=safe_config)\n", "\n", "# API request - run the query, and convert the results to a pandas DataFrame\n", "improved_df = query_job.to_dataframe()\n", "\n", "# Print the first five rows of the DataFrame\n", "improved_df.head()" ], "metadata": { "execution": { "iopub.status.busy": "2022-04-23T03:53:18.501704Z", "iopub.execute_input": "2022-04-23T03:53:18.501994Z", "iopub.status.idle": "2022-04-23T03:53:24.596965Z", "shell.execute_reply.started": "2022-04-23T03:53:18.501962Z", "shell.execute_reply": "2022-04-23T03:53:24.596042Z" }, "trusted": true, "id": "YCqkWvK_4fLX", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "90487304-d2e2-477f-de7e-26dd77fba79b" }, "execution_count": 27, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " parent NumPosts\n", "0 6683866 39\n", "1 6627329 46\n", "2 3476843 49\n", "3 7234010 48\n", "4 2932956 76" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
parentNumPosts
0668386639
1662732946
2347684349
3723401048
4293295676
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 6683866,\n 'f': \"6683866\",\n },\n{\n 'v': 39,\n 'f': \"39\",\n }],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 6627329,\n 'f': \"6627329\",\n },\n{\n 'v': 46,\n 'f': \"46\",\n }],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': 3476843,\n 'f': \"3476843\",\n },\n{\n 'v': 49,\n 'f': \"49\",\n }],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': 7234010,\n 'f': \"7234010\",\n },\n{\n 'v': 48,\n 'f': \"48\",\n }],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': 2932956,\n 'f': \"2932956\",\n },\n{\n 'v': 76,\n 'f': \"76\",\n }]],\n columns: [[\"number\", \"index\"], [\"number\", \"parent\"], [\"number\", \"NumPosts\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 27 } ] }, { "cell_type": "code", "source": [ "improved_df" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 441 }, "id": "e4yVjvH2tcBf", "outputId": "a32bf803-f8ba-4e73-c4d5-7d34be1e4266" }, "execution_count": 28, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Warning: total number of rows (77368) exceeds max_rows (20000). Falling back to pandas display.\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " parent NumPosts\n", "0 6683866 39\n", "1 6627329 46\n", "2 3476843 49\n", "3 7234010 48\n", "4 2932956 76\n", "... ... ...\n", "77363 2873865 37\n", "77364 6971290 37\n", "77365 8793579 37\n", "77366 6937686 37\n", "77367 412772 37\n", "\n", "[77368 rows x 2 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
parentNumPosts
0668386639
1662732946
2347684349
3723401048
4293295676
.........
77363287386537
77364697129037
77365879357937
77366693768637
7736741277237
\n", "

77368 rows × 2 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 28 } ] }, { "cell_type": "markdown", "source": [ "Now you have the data you want, and it has descriptive names. \n", "\n", "#### Note on using **GROUP BY**\n", "\n", "Note that because it tells SQL how to apply aggregate functions (like **COUNT()**), it doesn't make sense to use **GROUP BY** without an aggregate function. Similarly, if you have any **GROUP BY** clause, then all variables must be passed to either a\n", "1. **GROUP BY** command, or\n", "2. an aggregation function.\n", "\n", "Consider the query below:\n", "\n" ], "metadata": { "id": "hNf8IMK_4fLX" } }, { "cell_type": "code", "source": [ "query_good = \"\"\"\n", " SELECT parent, COUNT(id)\n", " FROM `bigquery-public-data.hacker_news.comments`\n", " GROUP BY parent\n", " \"\"\"" ], "metadata": { "id": "dTzcM2C04fLX" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Note that there are two variables: `parent` and `id`. \n", "- `parent` was passed to a **GROUP BY** command (in `GROUP BY parent`), and \n", "- `id` was passed to an aggregate function (in `COUNT(id)`).\n", "\n", "And this query won't work, because the `author` column isn't passed to an aggregate function or a **GROUP BY** clause:" ], "metadata": { "id": "oOpimGND4fLX" } }, { "cell_type": "code", "source": [ "query_bad = \"\"\"\n", " SELECT author, parent, COUNT(id)\n", " FROM `bigquery-public-data.hacker_news.comments`\n", " GROUP BY parent\n", " \"\"\"" ], "metadata": { "id": "57X5jB1J4fLX" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "### Order By" ], "metadata": { "id": "RYy0qPJ89w6r" } }, { "cell_type": "markdown", "source": [ "Frequently, you’ll want to sort your results. **Let's use the US Traffic Fatality Records database, which contains information on traffic accidents in the US where at least one person died.**\n", "\n", "We'll investigate the `accident_2015` table. Here is a view of the first few rows. " ], "metadata": { "id": "T7euCJRKBCxu" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"nhtsa_traffic_fatalities\" dataset\n", "dataset_ref = client.dataset(\"nhtsa_traffic_fatalities\", project=\"bigquery-public-data\")\n", "\n", "# API request - fetch the dataset\n", "dataset = client.get_dataset(dataset_ref)\n", "\n", "# Construct a reference to the \"accident_2015\" table\n", "table_ref = dataset_ref.table(\"accident_2015\")\n", "\n", "# API request - fetch the table\n", "table = client.get_table(table_ref)\n", "\n", "# Preview the first five lines of the \"accident_2015\" table\n", "client.list_rows(table, max_results=5).to_dataframe()" ], "metadata": { "id": "JpqOyftp9wVT", "colab": { "base_uri": "https://localhost:8080/", "height": 404 }, "outputId": "3004bd39-00c3-47ec-9431-134479707ad7" }, "execution_count": 31, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Warning: Total number of columns (70) exceeds max_columns (20). Falling back to pandas display.\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " state_number state_name consecutive_number \\\n", "0 19 Iowa 190204 \n", "1 19 Iowa 190233 \n", "2 19 Iowa 190179 \n", "3 19 Iowa 190248 \n", "4 19 Iowa 190231 \n", "\n", " number_of_vehicle_forms_submitted_all \\\n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 \n", "\n", " number_of_motor_vehicles_in_transport_mvit \\\n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 \n", "\n", " number_of_parked_working_vehicles \\\n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "\n", " number_of_forms_submitted_for_persons_not_in_motor_vehicles \\\n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "\n", " number_of_persons_not_in_motor_vehicles_in_transport_mvit \\\n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "\n", " number_of_persons_in_motor_vehicles_in_transport_mvit \\\n", "0 1 \n", "1 1 \n", "2 2 \n", "3 4 \n", "4 1 \n", "\n", " number_of_forms_submitted_for_persons_in_motor_vehicles ... \\\n", "0 1 ... \n", "1 1 ... \n", "2 2 ... \n", "3 4 ... \n", "4 1 ... \n", "\n", " minute_of_ems_arrival_at_hospital related_factors_crash_level_1 \\\n", "0 2 0 \n", "1 88 0 \n", "2 1 0 \n", "3 99 0 \n", "4 88 0 \n", "\n", " related_factors_crash_level_1_name related_factors_crash_level_2 \\\n", "0 None 0 \n", "1 None 0 \n", "2 None 0 \n", "3 None 0 \n", "4 None 0 \n", "\n", " related_factors_crash_level_2_name related_factors_crash_level_3 \\\n", "0 None 0 \n", "1 None 0 \n", "2 None 0 \n", "3 None 0 \n", "4 None 0 \n", "\n", " related_factors_crash_level_3_name number_of_fatalities \\\n", "0 None 1 \n", "1 None 1 \n", "2 None 1 \n", "3 None 2 \n", "4 None 1 \n", "\n", " number_of_drunk_drivers timestamp_of_crash \n", "0 1 2015-09-11 20:20:00+00:00 \n", "1 1 2015-11-01 00:30:00+00:00 \n", "2 0 2015-05-04 16:18:00+00:00 \n", "3 0 2015-11-17 12:26:00+00:00 \n", "4 0 2015-10-31 04:49:00+00:00 \n", "\n", "[5 rows x 70 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
state_numberstate_nameconsecutive_numbernumber_of_vehicle_forms_submitted_allnumber_of_motor_vehicles_in_transport_mvitnumber_of_parked_working_vehiclesnumber_of_forms_submitted_for_persons_not_in_motor_vehiclesnumber_of_persons_not_in_motor_vehicles_in_transport_mvitnumber_of_persons_in_motor_vehicles_in_transport_mvitnumber_of_forms_submitted_for_persons_in_motor_vehicles...minute_of_ems_arrival_at_hospitalrelated_factors_crash_level_1related_factors_crash_level_1_namerelated_factors_crash_level_2related_factors_crash_level_2_namerelated_factors_crash_level_3related_factors_crash_level_3_namenumber_of_fatalitiesnumber_of_drunk_driverstimestamp_of_crash
019Iowa1902041100011...20None0None0None112015-09-11 20:20:00+00:00
119Iowa1902331100011...880None0None0None112015-11-01 00:30:00+00:00
219Iowa1901791100022...10None0None0None102015-05-04 16:18:00+00:00
319Iowa1902481100044...990None0None0None202015-11-17 12:26:00+00:00
419Iowa1902311100011...880None0None0None102015-10-31 04:49:00+00:00
\n", "

5 rows × 70 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 31 } ] }, { "cell_type": "markdown", "source": [ "Let's use the table to determine how the number of accidents varies with the day of the week. Since:\n", "- the `consecutive_number` column contains a unique ID for each accident, and\n", "- the `timestamp_of_crash` column contains the date of the accident in [DATETIME](https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions) format,\n", "\n", "we can:\n", "- **EXTRACT** the day of the week (as `day_of_week` in the query below) from the `timestamp_of_crash` column, and\n", "- **GROUP BY** the day of the week, before we **COUNT** the `consecutive_number` column to determine the number of accidents for each day of the week.\n", "\n", "Then we sort the table with an **ORDER BY** clause, so the days with the most accidents are returned first." ], "metadata": { "id": "UUJ2fuKmBSc1" } }, { "cell_type": "code", "source": [ "# Query to find out the number of accidents for each day of the week\n", "query = \"\"\"\n", " SELECT COUNT(consecutive_number) AS num_accidents, \n", " EXTRACT(DAYOFWEEK FROM timestamp_of_crash) AS day_of_week\n", " FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2015`\n", " GROUP BY day_of_week\n", " ORDER BY num_accidents DESC\n", " \"\"\"" ], "metadata": { "id": "4I_8JldMBc04" }, "execution_count": 32, "outputs": [] }, { "cell_type": "code", "source": [ "# Set up the query (cancel the query if it would use too much of \n", "# your quota, with the limit set to 1 GB)\n", "safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)\n", "query_job = client.query(query, job_config=safe_config)\n", "\n", "# API request - run the query, and convert the results to a pandas DataFrame\n", "accidents_by_day = query_job.to_dataframe()\n", "\n", "# Print the DataFrame\n", "accidents_by_day" ], "metadata": { "id": "B6Dks4_hBrLv", "colab": { "base_uri": "https://localhost:8080/", "height": 240 }, "outputId": "6e6a401e-5e0d-4e45-81a9-478b6b676f6a" }, "execution_count": 33, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " num_accidents day_of_week\n", "0 5659 7\n", "1 5298 1\n", "2 4916 6\n", "3 4460 5\n", "4 4182 4\n", "5 4038 2\n", "6 3985 3" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_accidentsday_of_week
056597
152981
249166
344605
441824
540382
639853
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 5659,\n 'f': \"5659\",\n },\n{\n 'v': 7,\n 'f': \"7\",\n }],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 5298,\n 'f': \"5298\",\n },\n{\n 'v': 1,\n 'f': \"1\",\n }],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': 4916,\n 'f': \"4916\",\n },\n{\n 'v': 6,\n 'f': \"6\",\n }],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': 4460,\n 'f': \"4460\",\n },\n{\n 'v': 5,\n 'f': \"5\",\n }],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': 4182,\n 'f': \"4182\",\n },\n{\n 'v': 4,\n 'f': \"4\",\n }],\n [{\n 'v': 5,\n 'f': \"5\",\n },\n{\n 'v': 4038,\n 'f': \"4038\",\n },\n{\n 'v': 2,\n 'f': \"2\",\n }],\n [{\n 'v': 6,\n 'f': \"6\",\n },\n{\n 'v': 3985,\n 'f': \"3985\",\n },\n{\n 'v': 3,\n 'f': \"3\",\n }]],\n columns: [[\"number\", \"index\"], [\"number\", \"num_accidents\"], [\"number\", \"day_of_week\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 33 } ] }, { "cell_type": "markdown", "source": [ "Notice that the data is sorted by the `num_accidents` column, where the days with more traffic accidents appear first.\n", "\n", "To map the numbers returned for the `day_of_week` column to the actual day, you might consult [the BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions) on the DAYOFWEEK function. It says that it returns \"an integer between 1 (Sunday) and 7 (Saturday), inclusively\". So, in 2015, most fatal motor accidents in the US occured on Sunday and Saturday, while the fewest happened on Tuesday." ], "metadata": { "id": "vQ6usjVXBt4z" } }, { "cell_type": "markdown", "source": [ "### As and With" ], "metadata": { "id": "gC3QTzVCI6ST" } }, { "cell_type": "markdown", "source": [ "On its own, `AS` is a convenient way to clean up the data returned by your query. **We're going to use a common table expression (CTE)** to find out **how many Bitcoin transactions were made each day for the entire timespan of a bitcoin transaction dataset.**\n", "\n", "We'll investigate the transactions table. Here is a view of the first few rows." ], "metadata": { "id": "gsAQF0yII_2j" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"crypto_bitcoin\" dataset\n", "dataset_ref = client.dataset(\"crypto_bitcoin\", project=\"bigquery-public-data\")\n", "\n", "# API request - fetch the dataset\n", "dataset = client.get_dataset(dataset_ref)\n", "\n", "# Construct a reference to the \"transactions\" table\n", "table_ref = dataset_ref.table(\"transactions\")\n", "\n", "# API request - fetch the table\n", "table = client.get_table(table_ref)\n", "\n", "# Preview the first five lines of the \"transactions\" table\n", "client.list_rows(table, max_results=5).to_dataframe()" ], "metadata": { "id": "PN9j763YI-rz", "colab": { "base_uri": "https://localhost:8080/", "height": 875 }, "outputId": "4e654ebb-f984-4c3f-846d-03a6c8920c6c" }, "execution_count": 34, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " hash size virtual_size \\\n", "0 a16f3ce4dd5deb92d98ef5cf8afeaf0775ebca408f708b... 275 275 \n", "1 591e91f809d716912ca1d4a9295e70c3e78bab077683f7... 275 275 \n", "2 12b5633bad1f9c167d523ad1aa1947b2732a865bf5414e... 276 276 \n", "3 828ef3b079f9c23829c56fe86e85b4a69d9e06e5b54ea5... 276 276 \n", "4 35288d269cee1941eaebb2ea85e32b42cdb2b04284a56d... 277 277 \n", "\n", " version lock_time block_hash \\\n", "0 1 0 00000000dc55860c8a29c58d45209318fa9e9dc2c1833a... \n", "1 1 0 0000000054487811fc4ff7a95be738aa5ad9320c394c48... \n", "2 1 0 00000000f46e513f038baf6f2d9a95b2a28d8a6c985bcf... \n", "3 1 0 00000000fb5b44edc7a1aa105075564a179d65506e2bd2... \n", "4 1 0 00000000689051c09ff2cd091cc4c22c10b965eb8db3ad... \n", "\n", " block_number block_timestamp block_timestamp_month input_count \\\n", "0 181 2009-01-12 06:02:13+00:00 2009-01-01 1 \n", "1 182 2009-01-12 06:12:16+00:00 2009-01-01 1 \n", "2 183 2009-01-12 06:34:22+00:00 2009-01-01 1 \n", "3 248 2009-01-12 20:04:20+00:00 2009-01-01 1 \n", "4 545 2009-01-15 05:48:32+00:00 2009-01-01 1 \n", "\n", " output_count input_value output_value is_coinbase fee \\\n", "0 2 4000000000 4000000000 False 0 \n", "1 2 3000000000 3000000000 False 0 \n", "2 2 2900000000 2900000000 False 0 \n", "3 2 2800000000 2800000000 False 0 \n", "4 2 2500000000 2500000000 False 0 \n", "\n", " inputs \\\n", "0 [{'index': 0, 'spent_transaction_hash': 'f4184... \n", "1 [{'index': 0, 'spent_transaction_hash': 'a16f3... \n", "2 [{'index': 0, 'spent_transaction_hash': '591e9... \n", "3 [{'index': 0, 'spent_transaction_hash': '12b56... \n", "4 [{'index': 0, 'spent_transaction_hash': 'd71fd... \n", "\n", " outputs \n", "0 [{'index': 0, 'script_asm': '04b5abd412d4341b4... \n", "1 [{'index': 0, 'script_asm': '0401518fa1d1e1e3e... \n", "2 [{'index': 0, 'script_asm': '04baa9d3665315562... \n", "3 [{'index': 0, 'script_asm': '04bed827d37474bef... \n", "4 [{'index': 0, 'script_asm': '044a656f065871a35... " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hashsizevirtual_sizeversionlock_timeblock_hashblock_numberblock_timestampblock_timestamp_monthinput_countoutput_countinput_valueoutput_valueis_coinbasefeeinputsoutputs
0a16f3ce4dd5deb92d98ef5cf8afeaf0775ebca408f708b...2752751000000000dc55860c8a29c58d45209318fa9e9dc2c1833a...1812009-01-12 06:02:13+00:002009-01-011240000000004000000000False0[{'index': 0, 'spent_transaction_hash': 'f4184...[{'index': 0, 'script_asm': '04b5abd412d4341b4...
1591e91f809d716912ca1d4a9295e70c3e78bab077683f7...275275100000000054487811fc4ff7a95be738aa5ad9320c394c48...1822009-01-12 06:12:16+00:002009-01-011230000000003000000000False0[{'index': 0, 'spent_transaction_hash': 'a16f3...[{'index': 0, 'script_asm': '0401518fa1d1e1e3e...
212b5633bad1f9c167d523ad1aa1947b2732a865bf5414e...2762761000000000f46e513f038baf6f2d9a95b2a28d8a6c985bcf...1832009-01-12 06:34:22+00:002009-01-011229000000002900000000False0[{'index': 0, 'spent_transaction_hash': '591e9...[{'index': 0, 'script_asm': '04baa9d3665315562...
3828ef3b079f9c23829c56fe86e85b4a69d9e06e5b54ea5...2762761000000000fb5b44edc7a1aa105075564a179d65506e2bd2...2482009-01-12 20:04:20+00:002009-01-011228000000002800000000False0[{'index': 0, 'spent_transaction_hash': '12b56...[{'index': 0, 'script_asm': '04bed827d37474bef...
435288d269cee1941eaebb2ea85e32b42cdb2b04284a56d...2772771000000000689051c09ff2cd091cc4c22c10b965eb8db3ad...5452009-01-15 05:48:32+00:002009-01-011225000000002500000000False0[{'index': 0, 'spent_transaction_hash': 'd71fd...[{'index': 0, 'script_asm': '044a656f065871a35...
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"a16f3ce4dd5deb92d98ef5cf8afeaf0775ebca408f708b2146c4fb42b41e14be\",\n{\n 'v': 275,\n 'f': \"275\",\n },\n{\n 'v': 275,\n 'f': \"275\",\n },\n{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n },\n\"00000000dc55860c8a29c58d45209318fa9e9dc2c1833a7226d86bc465afc6e5\",\n{\n 'v': 181,\n 'f': \"181\",\n },\n\"2009-01-12 06:02:13+00:00\",\n\"2009-01-01\",\n{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': \"4000000000\",\n 'f': \"\\\"4000000000\\\"\",\n },\n{\n 'v': \"4000000000\",\n 'f': \"\\\"4000000000\\\"\",\n },\nfalse,\n{\n 'v': \"0\",\n 'f': \"\\\"0\\\"\",\n },\n[\"{'index': 0, 'spent_transaction_hash': 'f4184fc596403b9d638783cf57adfe4c75c605f6356fbc91338530e9831e9e16', 'spent_output_index': 1, 'script_asm': '3044022027542a94d6646c51240f23a76d33088d3dd8815b25e9ea18cac67d1171a3212e02203baf203c6e7b80ebd3e588628466ea28be572fe1aaa3f30947da4763dd3b3d2b[ALL]', 'script_hex': '473044022027542a94d6646c51240f23a76d33088d3dd8815b25e9ea18cac67d1171a3212e02203baf203c6e7b80ebd3e588628466ea28be572fe1aaa3f30947da4763dd3b3d2b01', 'sequence': 4294967295, 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['12cbQLTFMXRnSzktFkuoG3eHoMeFtpTu3S'], 'value': Decimal('4000000000')}\"],\n[\"{'index': 0, 'script_asm': '04b5abd412d4341b45056d3e376cd446eca43fa871b51961330deebd84423e740daa520690e1d9e074654c59ff87b408db903649623e86f1ca5412786f61ade2bf OP_CHECKSIG', 'script_hex': '4104b5abd412d4341b45056d3e376cd446eca43fa871b51961330deebd84423e740daa520690e1d9e074654c59ff87b408db903649623e86f1ca5412786f61ade2bfac', 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['1DUDsfc23Dv9sPMEk5RsrtfzCw5ofi5sVW'], 'value': Decimal('1000000000')}\", \"{'index': 1, 'script_asm': '0411db93e1dcdb8a016b49840f8c53bc1eb68a382e97b1482ecad7b148a6909a5cb2e0eaddfb84ccf9744464f82e160bfa9b8b64f9d4c03f999b8643f656b412a3 OP_CHECKSIG', 'script_hex': '410411db93e1dcdb8a016b49840f8c53bc1eb68a382e97b1482ecad7b148a6909a5cb2e0eaddfb84ccf9744464f82e160bfa9b8b64f9d4c03f999b8643f656b412a3ac', 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['12cbQLTFMXRnSzktFkuoG3eHoMeFtpTu3S'], 'value': Decimal('3000000000')}\"]],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"591e91f809d716912ca1d4a9295e70c3e78bab077683f79350f101da64588073\",\n{\n 'v': 275,\n 'f': \"275\",\n },\n{\n 'v': 275,\n 'f': \"275\",\n },\n{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n },\n\"0000000054487811fc4ff7a95be738aa5ad9320c394c482b27c0da28b227ad5d\",\n{\n 'v': 182,\n 'f': \"182\",\n },\n\"2009-01-12 06:12:16+00:00\",\n\"2009-01-01\",\n{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': \"3000000000\",\n 'f': \"\\\"3000000000\\\"\",\n },\n{\n 'v': \"3000000000\",\n 'f': \"\\\"3000000000\\\"\",\n },\nfalse,\n{\n 'v': \"0\",\n 'f': \"\\\"0\\\"\",\n },\n[\"{'index': 0, 'spent_transaction_hash': 'a16f3ce4dd5deb92d98ef5cf8afeaf0775ebca408f708b2146c4fb42b41e14be', 'spent_output_index': 1, 'script_asm': '304402201f27e51caeb9a0988a1e50799ff0af94a3902403c3ad4068b063e7b4d1b0a76702206713f69bd344058b0dee55a9798759092d0916dbbc3e592fee43060005ddc174[ALL]', 'script_hex': '47304402201f27e51caeb9a0988a1e50799ff0af94a3902403c3ad4068b063e7b4d1b0a76702206713f69bd344058b0dee55a9798759092d0916dbbc3e592fee43060005ddc17401', 'sequence': 4294967295, 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['12cbQLTFMXRnSzktFkuoG3eHoMeFtpTu3S'], 'value': Decimal('3000000000')}\"],\n[\"{'index': 0, 'script_asm': '0401518fa1d1e1e3e162852d68d9be1c0abad5e3d6297ec95f1f91b909dc1afe616d6876f92918451ca387c4387609ae1a895007096195a824baf9c38ea98c09c3 OP_CHECKSIG', 'script_hex': '410401518fa1d1e1e3e162852d68d9be1c0abad5e3d6297ec95f1f91b909dc1afe616d6876f92918451ca387c4387609ae1a895007096195a824baf9c38ea98c09c3ac', 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['1LzBzVqEeuQyjD2mRWHes3dgWrT9titxvq'], 'value': Decimal('100000000')}\", \"{'index': 1, 'script_asm': '0411db93e1dcdb8a016b49840f8c53bc1eb68a382e97b1482ecad7b148a6909a5cb2e0eaddfb84ccf9744464f82e160bfa9b8b64f9d4c03f999b8643f656b412a3 OP_CHECKSIG', 'script_hex': '410411db93e1dcdb8a016b49840f8c53bc1eb68a382e97b1482ecad7b148a6909a5cb2e0eaddfb84ccf9744464f82e160bfa9b8b64f9d4c03f999b8643f656b412a3ac', 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['12cbQLTFMXRnSzktFkuoG3eHoMeFtpTu3S'], 'value': Decimal('2900000000')}\"]],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"12b5633bad1f9c167d523ad1aa1947b2732a865bf5414eab2f9e5ae5d5c191ba\",\n{\n 'v': 276,\n 'f': \"276\",\n },\n{\n 'v': 276,\n 'f': \"276\",\n },\n{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n },\n\"00000000f46e513f038baf6f2d9a95b2a28d8a6c985bcf24b9e07f0f63a29888\",\n{\n 'v': 183,\n 'f': \"183\",\n },\n\"2009-01-12 06:34:22+00:00\",\n\"2009-01-01\",\n{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': \"2900000000\",\n 'f': \"\\\"2900000000\\\"\",\n },\n{\n 'v': \"2900000000\",\n 'f': \"\\\"2900000000\\\"\",\n },\nfalse,\n{\n 'v': \"0\",\n 'f': \"\\\"0\\\"\",\n },\n[\"{'index': 0, 'spent_transaction_hash': '591e91f809d716912ca1d4a9295e70c3e78bab077683f79350f101da64588073', 'spent_output_index': 1, 'script_asm': '3045022052ffc1929a2d8bd365c6a2a4e3421711b4b1e1b8781698ca9075807b4227abcb0221009984107ddb9e3813782b095d0d84361ed4c76e5edaf6561d252ae162c2341cfb[ALL]', 'script_hex': '483045022052ffc1929a2d8bd365c6a2a4e3421711b4b1e1b8781698ca9075807b4227abcb0221009984107ddb9e3813782b095d0d84361ed4c76e5edaf6561d252ae162c2341cfb01', 'sequence': 4294967295, 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['12cbQLTFMXRnSzktFkuoG3eHoMeFtpTu3S'], 'value': Decimal('2900000000')}\"],\n[\"{'index': 0, 'script_asm': '04baa9d36653155627c740b3409a734d4eaf5dcca9fb4f736622ee18efcf0aec2b758b2ec40db18fbae708f691edb2d4a2a3775eb413d16e2e3c0f8d4c69119fd1 OP_CHECKSIG', 'script_hex': '4104baa9d36653155627c740b3409a734d4eaf5dcca9fb4f736622ee18efcf0aec2b758b2ec40db18fbae708f691edb2d4a2a3775eb413d16e2e3c0f8d4c69119fd1ac', 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['13HtsYzne8xVPdGDnmJX8gHgBZerAfJGEf'], 'value': Decimal('100000000')}\", \"{'index': 1, 'script_asm': '0411db93e1dcdb8a016b49840f8c53bc1eb68a382e97b1482ecad7b148a6909a5cb2e0eaddfb84ccf9744464f82e160bfa9b8b64f9d4c03f999b8643f656b412a3 OP_CHECKSIG', 'script_hex': '410411db93e1dcdb8a016b49840f8c53bc1eb68a382e97b1482ecad7b148a6909a5cb2e0eaddfb84ccf9744464f82e160bfa9b8b64f9d4c03f999b8643f656b412a3ac', 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['12cbQLTFMXRnSzktFkuoG3eHoMeFtpTu3S'], 'value': Decimal('2800000000')}\"]],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"828ef3b079f9c23829c56fe86e85b4a69d9e06e5b54ea597eef5fb3ffef509fe\",\n{\n 'v': 276,\n 'f': \"276\",\n },\n{\n 'v': 276,\n 'f': \"276\",\n },\n{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n },\n\"00000000fb5b44edc7a1aa105075564a179d65506e2bd25f55f1629251d0f6b0\",\n{\n 'v': 248,\n 'f': \"248\",\n },\n\"2009-01-12 20:04:20+00:00\",\n\"2009-01-01\",\n{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': \"2800000000\",\n 'f': \"\\\"2800000000\\\"\",\n },\n{\n 'v': \"2800000000\",\n 'f': \"\\\"2800000000\\\"\",\n },\nfalse,\n{\n 'v': \"0\",\n 'f': \"\\\"0\\\"\",\n },\n[\"{'index': 0, 'spent_transaction_hash': '12b5633bad1f9c167d523ad1aa1947b2732a865bf5414eab2f9e5ae5d5c191ba', 'spent_output_index': 1, 'script_asm': '3045022100c12a7d54972f26d14cb311339b5122f8c187417dde1e8efb6841f55c34220ae0022066632c5cd4161efa3a2837764eee9eb84975dd54c2de2865e9752585c53e7cce[ALL]', 'script_hex': '483045022100c12a7d54972f26d14cb311339b5122f8c187417dde1e8efb6841f55c34220ae0022066632c5cd4161efa3a2837764eee9eb84975dd54c2de2865e9752585c53e7cce01', 'sequence': 4294967295, 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['12cbQLTFMXRnSzktFkuoG3eHoMeFtpTu3S'], 'value': Decimal('2800000000')}\"],\n[\"{'index': 0, 'script_asm': '04bed827d37474beffb37efe533701ac1f7c600957a4487be8b371346f016826ee6f57ba30d88a472a0e4ecd2f07599a795f1f01de78d791b382e65ee1c58b4508 OP_CHECKSIG', 'script_hex': '4104bed827d37474beffb37efe533701ac1f7c600957a4487be8b371346f016826ee6f57ba30d88a472a0e4ecd2f07599a795f1f01de78d791b382e65ee1c58b4508ac', 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['1ByLSV2gLRcuqUmfdYcpPQH8Npm8cccsFg'], 'value': Decimal('1000000000')}\", \"{'index': 1, 'script_asm': '0411db93e1dcdb8a016b49840f8c53bc1eb68a382e97b1482ecad7b148a6909a5cb2e0eaddfb84ccf9744464f82e160bfa9b8b64f9d4c03f999b8643f656b412a3 OP_CHECKSIG', 'script_hex': '410411db93e1dcdb8a016b49840f8c53bc1eb68a382e97b1482ecad7b148a6909a5cb2e0eaddfb84ccf9744464f82e160bfa9b8b64f9d4c03f999b8643f656b412a3ac', 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['12cbQLTFMXRnSzktFkuoG3eHoMeFtpTu3S'], 'value': Decimal('1800000000')}\"]],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"35288d269cee1941eaebb2ea85e32b42cdb2b04284a56d8b14dcc3f5c65d6055\",\n{\n 'v': 277,\n 'f': \"277\",\n },\n{\n 'v': 277,\n 'f': \"277\",\n },\n{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n },\n\"00000000689051c09ff2cd091cc4c22c10b965eb8db3ad5f032621cc36626175\",\n{\n 'v': 545,\n 'f': \"545\",\n },\n\"2009-01-15 05:48:32+00:00\",\n\"2009-01-01\",\n{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': \"2500000000\",\n 'f': \"\\\"2500000000\\\"\",\n },\n{\n 'v': \"2500000000\",\n 'f': \"\\\"2500000000\\\"\",\n },\nfalse,\n{\n 'v': \"0\",\n 'f': \"\\\"0\\\"\",\n },\n[\"{'index': 0, 'spent_transaction_hash': 'd71fd2f64c0b34465b7518d240c00e83f6a5b10138a7079d1252858fe7e6b577', 'spent_output_index': 0, 'script_asm': '304602210083ec8bd391269f00f3d714a54f4dbd6b8004b3e9c91f3078ff4fca42da456f4d0221008dfe1450870a717f59a494b77b36b7884381233555f8439dac4ea969977dd3f4[ALL]', 'script_hex': '49304602210083ec8bd391269f00f3d714a54f4dbd6b8004b3e9c91f3078ff4fca42da456f4d0221008dfe1450870a717f59a494b77b36b7884381233555f8439dac4ea969977dd3f401', 'sequence': 4294967295, 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['1DCbY2GYVaAMCBpuBNN5GVg3a47pNK1wdi'], 'value': Decimal('2500000000')}\"],\n[\"{'index': 0, 'script_asm': '044a656f065871a353f216ca26cef8dde2f03e8c16202d2e8ad769f02032cb86a5eb5e56842e92e19141d60a01928f8dd2c875a390f67c1f6c94cfc617c0ea45af OP_CHECKSIG', 'script_hex': '41044a656f065871a353f216ca26cef8dde2f03e8c16202d2e8ad769f02032cb86a5eb5e56842e92e19141d60a01928f8dd2c875a390f67c1f6c94cfc617c0ea45afac', 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['1DZTzaBHUDM7T3QvUKBz4qXMRpkg8jsfB5'], 'value': Decimal('100000000')}\", \"{'index': 1, 'script_asm': '04f36c67039006ec4ed2c885d7ab0763feb5deb9633cf63841474712e4cf0459356750185fc9d962d0f4a1e08e1a84f0c9a9f826ad067675403c19d752530492dc OP_CHECKSIG', 'script_hex': '4104f36c67039006ec4ed2c885d7ab0763feb5deb9633cf63841474712e4cf0459356750185fc9d962d0f4a1e08e1a84f0c9a9f826ad067675403c19d752530492dcac', 'required_signatures': 1, 'type': 'pubkey', 'addresses': ['1DCbY2GYVaAMCBpuBNN5GVg3a47pNK1wdi'], 'value': Decimal('2400000000')}\"]]],\n columns: [[\"number\", \"index\"], [\"string\", \"hash\"], [\"number\", \"size\"], [\"number\", \"virtual_size\"], [\"number\", \"version\"], [\"number\", \"lock_time\"], [\"string\", \"block_hash\"], [\"number\", \"block_number\"], [\"string\", \"block_timestamp\"], [\"string\", \"block_timestamp_month\"], [\"number\", \"input_count\"], [\"number\", \"output_count\"], [\"number\", \"input_value\"], [\"number\", \"output_value\"], [\"string\", \"is_coinbase\"], [\"number\", \"fee\"], [\"string\", \"inputs\"], [\"string\", \"outputs\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 34 } ] }, { "cell_type": "markdown", "source": [ "Since the `block_timestamp` column contains the date of each transaction in DATETIME format, we'll convert these into DATE format using the **DATE()** command.\n", "\n", "We do that using a CTE, and then the next part of the query counts the number of transactions for each date and sorts the table so that earlier dates appear first. " ], "metadata": { "id": "afJXEFTGK1vO" } }, { "cell_type": "code", "source": [ "# Query to select the number of transactions per date, sorted by date\n", "query_with_CTE = \"\"\" \n", " WITH time AS \n", " (\n", " SELECT DATE(block_timestamp) AS trans_date\n", " FROM `bigquery-public-data.crypto_bitcoin.transactions`\n", " )\n", " SELECT COUNT(1) AS transactions,\n", " trans_date\n", " FROM time\n", " GROUP BY trans_date\n", " ORDER BY trans_date\n", " \"\"\"\n", "\n", "# Set up the query (cancel the query if it would use too much of \n", "# your quota, with the limit set to 10 GB)\n", "safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)\n", "query_job = client.query(query_with_CTE, job_config=safe_config)\n", "\n", "# API request - run the query, and convert the results to a pandas DataFrame\n", "transactions_by_date = query_job.to_dataframe()\n", "\n", "# Print the first five rows\n", "transactions_by_date.head()" ], "metadata": { "id": "yvWGKLH2B8Qa", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "4461e353-211d-44a1-945b-07f6317ffb70" }, "execution_count": 35, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " transactions trans_date\n", "0 1 2009-01-03\n", "1 14 2009-01-09\n", "2 61 2009-01-10\n", "3 93 2009-01-11\n", "4 101 2009-01-12" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
transactionstrans_date
012009-01-03
1142009-01-09
2612009-01-10
3932009-01-11
41012009-01-12
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 1,\n 'f': \"1\",\n },\n\"2009-01-03\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 14,\n 'f': \"14\",\n },\n\"2009-01-09\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': 61,\n 'f': \"61\",\n },\n\"2009-01-10\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': 93,\n 'f': \"93\",\n },\n\"2009-01-11\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': 101,\n 'f': \"101\",\n },\n\"2009-01-12\"]],\n columns: [[\"number\", \"index\"], [\"number\", \"transactions\"], [\"string\", \"trans_date\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 35 } ] }, { "cell_type": "markdown", "source": [ "Since they're returned sorted, we can easily plot the raw results to show us the number of Bitcoin transactions per day over the whole timespan of this dataset." ], "metadata": { "id": "gqgnFoJ1LMYf" } }, { "cell_type": "code", "source": [ "transactions_by_date.set_index('trans_date').plot()" ], "metadata": { "id": "qXRnbHIELMzn", "colab": { "base_uri": "https://localhost:8080/", "height": 297 }, "outputId": "23c9be13-57b8-4aae-85b2-c0f11474af79" }, "execution_count": 36, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 36 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEHCAYAAABbZ7oVAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3dd3xV9f348dc7IUDCHiGMsARkG4TIEFEQgQAq1G1txX4dX6ut1tYqft1alVp/1uJe1NViq1WggiKgiIOVMGRK2IQ9wwoQks/vj3tuuLm5O+fckbyfPnhw7+d8zud8Erznfc9nijEGpZRSypekWFdAKaVU/NIgoZRSyi8NEkoppfzSIKGUUsovDRJKKaX8qhHrCtitadOmpl27drGuhlJKJZS8vLx9xph07/QqFyTatWtHbm5urKuhlFIJRUS2+ErX5iallFJ+hRQkRGSziKwQkWUikmulNRaRWSKSb/3dyEoXEZkoIutF5EcR6e1Rzjgrf76IjPNI72OVv946VwJdQymlVHSE8yQxxBjTyxiTbb0fD8wxxnQC5ljvAUYCnaw/twGvguuGDzwK9AP6Ao963PRfBW71OC8nyDWUUkpFQWX6JMYAg63X7wJzgfut9PeMa72PBSLSUERaWHlnGWMOAIjILCBHROYC9Y0xC6z094CxwOcBrqGUSkDFxcUUFBRw4sSJWFel2qpduzaZmZmkpKSElD/UIGGAL0XEAK8bY94AMowxO63ju4AM63UrYJvHuQVWWqD0Ah/pBLiGUioBFRQUUK9ePdq1a4fVqqyiyBjD/v37KSgooH379iGdE2qQuMAYs11EmgGzRGSt14WNFUAcE+gaInIbrqYt2rRp42Q1lFKVcOLECQ0QMSQiNGnShL1794Z8Tkh9EsaY7dbfe4BPcfUp7LaakbD+3mNl3w609jg900oLlJ7pI50A1/Cu3xvGmGxjTHZ6eoVhvkqpOKIBIrbC/f0HDRIiUkdE6rlfA8OBlcA0wD1CaRww1Xo9DbjRGuXUHyi0moxmAsNFpJHVYT0cmGkdOywi/a1RTTd6leXrGkqpGPth/T5OFJfEuhrKYaE8SWQA34nIcmARMN0Y8wUwARgmIvnAJdZ7gBnARmA98CZwB4DVYf0ksNj684S7E9vK85Z1zgZcndYEuIZSKoY27D3Kz99ayENTVsa6KmE5dOgQr7zySqyrUc4777zDjh07yt7fcsstrF69OoY1Ki9on4QxZiOQ5SN9PzDUR7oB7vRT1iRgko/0XKBHqNdQSsXW4aJiAPL3HI1xTcLjDhJ33HFHufTTp09To0ZsFqB455136NGjBy1btgTgrbfeikk9/NEZ10qpamP8+PFs2LCBXr16cd555zFo0CAuv/xyunXrBsDYsWPp06cP3bt354033ig7r27dujz44INkZWXRv39/du/eDcBHH31Ejx49yMrK4sILLwRg8+bNDBo0iN69e9O7d29++OGHsnL+/Oc/07NnT7Kyshg/fjwff/wxubm53HDDDfTq1YuioiIGDx5ctrTQ5MmT6dmzJz169OD++++PqD6VVeXWblJKJYbH/7uK1TsO21pmt5b1efSy7n6PT5gwgZUrV7Js2TLmzp3L6NGjWblyZdlw0EmTJtG4cWOKioo477zzuPLKK2nSpAnHjh2jf//+PPXUU9x33328+eabPPTQQzzxxBPMnDmTVq1acejQIQCaNWvGrFmzqF27Nvn5+Vx//fXk5uby+eefM3XqVBYuXEhaWhoHDhygcePGvPTSSzz33HNkZ2eXq+uOHTu4//77ycvLo1GjRgwfPpwpU6YwduzYsOpTWfokoZQKm6Pj3aOob9++5eYLTJw4sezb+bZt28jPzwegZs2aXHrppQD06dOHzZs3AzBw4EBuuukm3nzzTUpKXJ34xcXF3HrrrfTs2ZOrr766rH9h9uzZ/OpXvyItLQ2Axo0bB6zb4sWLGTx4MOnp6dSoUYMbbriBefPmhV2fytInCaVUTAT6xh8tderUKXs9d+5cZs+ezfz580lLS2Pw4MFlM8NTUlLKho4mJydz+vRpAF577TUWLlzI9OnT6dOnD3l5ebz44otkZGSwfPlySktLqV27tu31Dqc+TZo0qdS19ElCKRW2RJ3pUK9ePY4cOeLzWGFhIY0aNSItLY21a9eyYMGCoOVt2LCBfv368cQTT5Cens62bdsoLCykRYsWJCUl8f7775d9ox82bBh///vfOX78OAAHDhwIWKe+ffvyzTffsG/fPkpKSpg8eTIXXXRR2PWpLH2SUEqFLVGbm5o0acLAgQPp0aMHqampZGScWeknJyeH1157ja5du9K5c2f69+8ftLw//vGP5OfnY4xh6NChZGVlcccdd3DllVfy3nvvkZOTU/a0kpOTw7Jly8jOzqZmzZqMGjWKp59+mptuuonbb7+d1NRU5s+fX1Z2ixYtmDBhAkOGDMEYw+jRoxkzZkzY9akscY1YrTqys7ONbjqklLOWbD3IFa/8QFbrhky9c2DI561Zs4auXbs6WDMVCl//DiKS57HKdxltblJKhS1Rm5tU+DRIKKXCVrXaH1QgGiSUUlFV1Zq4E024v38NEkqpqKlduzb79+/XQBEj7v0kwhmWq6OblFIRC7dvIjMzk4KCgrD2M1D2cu9MFyoNEkqpiIX7PJCSkhLyjmgBr2s9iejeFM7T5ialVMK57+Mfaf/AjFhXo1rQIKGUilisvsd/lFcQoytXPxoklFIR0+7nqk+DhFIqbNsPFsW6CipKNEgopcL228lLAZ15XR1okFBKKeWXBgmllFJ+aZBQSinllwYJpZRSfmmQUEop5ZcGCaWUUn5pkFBKRWzZtkPcZQ2HVVWTBgmlVKVMW74jZtfWJcedp0FCKZWwNEY4T4OEUiphaYxwngYJpVTC0uYm52mQUEop5ZcGCaVUwtLnCOdpkFBKJSx3a9NzM38ib8vB2Famigo5SIhIsogsFZHPrPftRWShiKwXkX+JSE0rvZb1fr11vJ1HGQ9Y6T+JyAiP9Bwrbb2IjPdI93kNpZQCGD3xWx6espKXvl7Pla/+EOvqVEnhPEncDazxeP9n4K/GmI7AQeBmK/1m4KCV/lcrHyLSDbgO6A7kAK9YgScZeBkYCXQDrrfyBrqGUqoaKjxeTLvx08ve5+85yvsLtsSwRhX9J6+ADXuPxroatgkpSIhIJjAaeMt6L8DFwMdWlneBsdbrMdZ7rONDrfxjgA+NMSeNMZuA9UBf6896Y8xGY8wp4ENgTJBrKKWqod9MXhLrKgT1h4+Wk/PCvFhXwzahPkm8ANwHlFrvmwCHjDGnrfcFQCvrdStgG4B1vNDKX5budY6/9EDXKEdEbhORXBHJ3bt3b4g/klIqEkWnSmJ27UTZNrW4pOp0qQcNEiJyKbDHGJMXhfpExBjzhjEm2xiTnZ6eHuvqKFWlHTlZHLNr7yhMjCBRldQIIc9A4HIRGQXUBuoDfwMaikgN65t+JrDdyr8daA0UiEgNoAGw3yPdzfMcX+n7A1xDKVUNnSguDZ5J2Srok4Qx5gFjTKYxph2ujuevjDE3AF8DV1nZxgFTrdfTrPdYx78yrmmR04DrrNFP7YFOwCJgMdDJGslU07rGNOscf9dQSsVIkkisq6CiqDLzJO4Hfi8i63H1H7xtpb8NNLHSfw+MBzDGrAL+DawGvgDuNMaUWE8JvwFm4ho99W8rb6BrKKViREOEf1VxmZBQmpvKGGPmAnOt1xtxjUzyznMCuNrP+U8BT/lInwHM8JHu8xpKKRWPqmCM0BnXSqnwlFbBG2Egd01eylvfbgwpb2kVjBIaJJRSYamKTSqBTFu+gz9NXxM8I1UzgGqQUEqFJZ7vg3lbDjh+jdzNB9h/9GS5tIUb99Nu/HTW7jrs+PWjTYOEUios8dyksu/oKcevcdVr87nm9fnl0j5fuQuAy1/63vHrR5sGCaVUWGI54zqYklLD4RPFjjeJbdh7rNz7qjwqOKzRTUqpqmf/0ZPk7zlK/7OahJT/xkmLHK5R5O7+cGnZkhg9WzXgv7+9IGD+nYVF1KlVg5ISQ6M6gReZ/sfCLdzQr23Z+2tem8+Rk6f5/O5Bla94HNMnCaWquWvfWMB1byzgw0VbQ8pfEIP1kz7K3Ubfp2YHzee5ZtKK7YUB8x44dooBz3zFOY99yblPzgpa9oOfrmTbgeNl7xdtPsCanf77INqNn84PG/YFLTfeaZBQqppbv8e1rPX4T1bEuCb+jf9kBXuOnAyeMQyHjofffzHo2a99poufKYa//efSsK8RbzRIKFWNlSbImM1I+xjGBWgas+snN8Yw6ftNPo8VlyT+WlMaJJSqxopL7bmJOR1sIi39m3X+tw6wq2/7p91H/B47fOJ02evv8vexMkgTWDzSIKFUNeavmcSXtbsOM3v1bp/HTjn8jTkao24j3SP71OnAP/vzX/4EwC/eXsilL34X0TViSYOEUtVYOHMecl74llvey3WwNhUZY3j9mw2VKsOzs9mr9HLvNu07ViHHjwWHKnVtgIlfra90GbGkQUKpamz+xv22lOPUN/3PftzJM5+vrVQZ/1i4lZ92VWwS8q6zr36PCSFcO47nFtpCg4RS1cSgZ7/imRnl1yAqsWmbTadmYfv6dh+u177ZwIgX5nHPv5Zx9OSZPgLvGs9ctavS16qKNEgoVcUdP3WarfuPs+1AEa/PK7+aaaBb+48Fh3hv/uaQrhHPS3W4fbp0O/9YsAWAfUdP8tnyHeWOz16zp8I5P2yw50krkemMa6WquJsmLWbR5jML383fsJ8BHYLPrnavQ3TjgHZB8ybISNoyFz83t9zII2+//iCP6/q2se16ew6fsK2saNMnCaWqOM8AAXD9mws4ZjW7eI/jj3Q+QqIsH+6uZaAAcaK4hM9X7go4xyJcv/7HEtvKijYNEkpVQ8etRfoenrKyXPqW/f5GAgXm1JNELNbN6/LwF2HlH/Ny8JVfC4uKI61OzGmQUKoa2rLf1SG8/1j5pSlCueH5kgh9EgDvz9/CZhs6w6sTDRJKVUPe+yG4RfqNN1GCxPZDRVzt52ePR//36Qqen7UupnXQIKFUNWR381CCxAgA9tq8UGAo3IsohuufC7cycU6+zbUJjwYJpaqoaHYmO/UkUZU380kUGiSUqmL2HjlJ3paDtH9gBt/m+1/gzs4VSp3ouF6z8zDPfRndppZFm5zfIzvR6DwJpaqY8zw25/nl2/6Hcb78tX1rCjmxCuzIv31re5nBFByMbHSX3Xo98SWDz06PdTUAfZJQqtrad9S+tvlE6pMI5D9LCmJdBQAOHS9myrIdwTNGgQYJpaqpcJYJDyZRRjcF8/16XYbDmwYJpaqpQJvlhCsWQaJVw9SoX7M60iChVDUVaift0q3BN+N56zvf23c64az0OgCk16sVtWtWZxoklFIBvTd/S9A8/1y4NQo1cUmyxsUmagPXe/M3hz2yrN346Rw6fip4RgdokFBKBRRv/Q3uZqbRPZvHuCaReWTqKl6MYLe6FTHaH1uDhFIqoFjEiB/W7/N7LKN+LVY9PoJbB50VxRrZy/+WqvEnaJAQkdoiskhElovIKhF53EpvLyILRWS9iPxLRGpa6bWs9+ut4+08ynrASv9JREZ4pOdYaetFZLxHus9rKKWixxD9pcBfCLAUxfFTJdSpVQNJ4OnYp077b27y97v+dMl2ThSXOFUlv0J5kjgJXGyMyQJ6ATki0h/4M/BXY0xH4CBws5X/ZuCglf5XKx8i0g24DugO5ACviEiyiCQDLwMjgW7A9VZeAlxDKRUlpcZw8HjFhf86Navr2DUDdapPX7HTsetGS6AmPH+HPlm6nWe/+MmhGvkXNEgYF/fqVCnWHwNcDHxspb8LjLVej7HeYx0fKq6QPwb40Bhz0hizCVgP9LX+rDfGbDTGnAI+BMZY5/i7hlIqWgx8tbbi1p52O3m6hBfn5Af8lg1VY+JeSYAZ6oF+vN1Hor/DXUjLcljf9vOAjri+9W8ADhlj3Ns7FQCtrNetgG0AxpjTIlIINLHSF3gU63nONq/0ftY5/q7hXb/bgNsA2rSxb8tBpRQYP7ctu1t73v5uE/9v1jpOJ9peqBGIdDBALHYADKnj2hhTYozpBWTi+ubfxdFahckY84YxJtsYk52eHh/rnShVVZSWwhIfcyXsnLENlDWl7CwssrXceBQoDgYKBLF4igprdJMx5hDwNTAAaCgi7ieRTGC79Xo70BrAOt4A2O+Z7nWOv/T9Aa6hlIqSTfuORXUexM7C8JtUzmvXyIGaOCdQc1PBQf9Bssir47roVAn9np4dcLXfygpldFO6iDS0XqcCw4A1uILFVVa2ccBU6/U06z3W8a+MKzROA66zRj+1BzoBi4DFQCdrJFNNXJ3b06xz/F1DKRUl/pbvcGpw0bf5/oe/elv1+Aim33UBH91+vjOVcYi/5qZPlxbw8NSVPo8BzP1pr9f7Pew+fJJ7/rXc1vp5CqVPogXwrtUvkQT82xjzmYisBj4UkT8BS4G3rfxvA++LyHrgAK6bPsaYVSLyb2A1cBq40xhTAiAivwFmAsnAJGPMKqus+/1cQynlEGNM1IeXRtrWXqdWDbq3bGBzbZznK0jsOXIirJv9Va/+QO4WVzOgnSv6egsaJIwxPwLn+kjfiKt/wjv9BHC1n7KeAp7ykT4DmBHqNZRSzjEm+jvCVYO+6nJ8NTcFaoJy65xRr+y1O0C4/emz1Tx0aTfvUypNZ1wrpcoJdeSNnU8bsRi14/Z/o6I/DqfUxyjfUAYC1Kvt/3u9U4ssapBQqgqx42br3e7t5LXc7HiS+Mct/SI677YLO1T+4mHyFYhLQvh9Gly7AJ48Hb2Z1xoklIpDKwoKOXyi4iznaLjlvdyoX9OORQQbpKbYUJPo8BUQQtkCNm/LQa567Qc6P/SFE9XySYOEUnHGGMNlL33HTZP870/t7zy7VgoNZSlre5ubbCsqIfgKCKH+DpZsPWRzbQLTIKFUnHHfLMK9Gfxz0VYuf+l7W+qQtyX4RkN2hIhl2w7x7Bdr/c7qjtSQzq5JtSO6Z9C9ZX1bymzTOM2WcsB389reo9FfciMUIS3LoZSKnkibXtbsPGxbHZJCeEqw40Fi7MuuoHbHkI5+8ww4qwnzNwbfe7pL83qM6tmck8WlTLz+XHYdPkGrhqlc9doPla8okJxk35OTr5FMV74637by7aRBQqk4EyhEHD5RzCtfb+APw88mJbl8Q4CdTTbXvB7dG9aW/cf8HntrXDbbDh4n54VvA5ZRIzmJV27oU/a+Q7p9q9Q2qVOT+gFGFoUr3jZyCkSbm5SKM4HuH/9v5k+89s0GpiytuEJNKOPs49Xoid/5PVanVg26NLenyShSIvDaL/vw4KiutpSXSP9WGiSUijPe7fPGmLKdzNxr9/j6Juq9ro/TEnjPn7D9enBHWjRI5dYL7dkNL3/P0bLXRadKuPOfS2wp1wkaJJSKktJSw8FjwTez977/f7h4G4Oe/Zq8LQfLOjx9TbwKpR/BLoM6NY343Giv8ur+Xb32iz4se2RYRGXcdH67stdpNZPtqFaZro98wfQf43cjJQ0SSkXJi1+t59wnZ7H7cOBRLN5BYrG1S9umfcfOHPMRD2L9xX5PkJ8LIHfzAQY88xWfLCmIQo1c3LGzRYPaNEw7swPypee04D+/HhBaGR6v7wzQyV4VaZBQKkpmrdkFEDxIeDU3uZuWkpPOHHM/NewsLOKnXa5VWqO5KJ+vfpMbveZ17D58gmMnT5dLW7fb1cwSaHtST3aMKMqoXxuA2inlnwC6tqhPn7aNg57/wrW9SPKoR6Bf8x9HdCane/OQ6jVt+Y6Q8sWaBgml4ox3n6b7fZJI2c3ZfZ8a8MxXjHhhHtsPFZEc5U+zd5PXLq/g1+/pOVz2UvkOaXcdS0oNx0+VDyC+ZLet/D4Rz12dxV+vzaJz83rBM/uQXq9WufcjrCAw4Yqe5dKHdcvgziEdee2XrhFWNYIEuLsmL2VFgT2TH52kQ2CVijPeayIt3XZmYtun1qimUmPK7RY3cMJX0alcAL6eLjbuLT+0NTnJFSVKSg13TV4ajWrRIDWFn52bWSG9XZM6IZ3vPZS2Q3pdNk8Y7Vqe+5MVZemPeKzA+ukd59O8QW1q1Uhm1N++rRBA3byDaGU5scy7BgmlosT9zTvYEHnvw9sOuDp6l3rMwH7w05Wc37GJndULi7H+K5cWwth/95PEvPx9ju6BEIrR57QImufe4WfTvEFtn8e8b8WtPWZkn9vmzBNQs/q1/AYJgJE9mvP5yl1B6xIKJ5Z51+YmpaIk1A/viVO+h7K+88PmstenSkpJjvEYVO+YEMrIf3egjHWACFU05rylptg3WsqJSXoaJJSKskAf45JSwz3/XhZSOUk2LhMRLmMq3kCPnAjexxAPcyua1KkZPFMIbLsd2/g7cWKOnjY3KRVHej85i8KiM0uEHz3p/8a7qzB2C8IZY+NNMtB1HCjz+/EXJ9SyGOFw4ufSIKFUHPEMEAAPfbrCT05sWxY8UpFsOvT6NxsdqEl4vIfCBhLqT3hOZuT7bO89Yl/TmxOxT5ublIqSSFoVdh+O37b7SG5Iq21cqTbW7Gol+jZ/n00laZ+EUlVCON/AnWjDt6NN3mB83pB6PDqTHYfsW3ZjaJdmtpUVjmuzWwOhB8JA+a7qU3H4rVM0SCiVyKw7vvfHuO9Ts/ndh0s57WM3uGitx/Ti9eeGfY6v29HRk6f5cpU9wzm7NK/HbTYtqBeujPq1guapEeLsxV/2b8vGp0dVtkohcaKnRYOEUjG258hJpizbwQwfY+XtjBGXZ7X0W+Zl1rFwNK/ve/6A940qlJnVvjRITYnqUiO+BNoxr0FqCvdccnbQMkQkaiPRTPBdZ8OmQUKpOHHkRHHwTGH61cB2Za8nXn8umyeMtqVcY1xPH3+9NqvCMe9hmPPW7Y3oGt7LYUTL5Fv7hxydh3RxbZNq9/arkdLmJqUSWLDbjvfIJqj8on3eu9fZqVGdmj6Xu/D2wuz8kPspWnrMbn7Ga20kt1n3XBhaBSM0oEPoM9mb1nUFsvM7RL50up00SChVBXh+jj07sZ/94qcKeSvbSOGrk7xWjcrP8A10K/K+5tpdR/j1B3khlXuBtU/F3UM7Ua92is88nTIiW6gvkN5tGvpMD3bPbdkwlXl/HMJ9IzrbXqdIlGiQUCpx+XoomLEicCfvNxE21YCr49eXD27px80XtGdE9wzAtT4RnOmzqKy9R09WmAR4oji0xvKBHZuyecJo7hkWvK3fTtdYo5ncwgnObZqkhdyJnYiq7k+mVAI4etL+fgi3//z6fJ/p7ZvW4eFLu5Wtbmp35/D+o6e43Gt106MnT/PQFP8TA93G9Gpla11C5f4VNEgt//QSHz0NsaUzrpWKujO3HieHuNapFfjj7X0DDKsqAe6eH+dV3HVu+6EiPliwNYwLRJd74cHh3VxPVx2buQJoh/TQlhOvyjRIKBUlvu7BTnYsQ+A29VYNUwHXtp4QXhOL52ieZvVqscfGpSX8adM4jfPD6FQOh/fopEvPaUHbJmn0bBX5cht2mXLnQMa+/H3Mrq9BQimH/Hf5DhZtOsCTY3uUS/e8cQf7tl9ZgZpLft63Da0apjK4s2sYZ7umkX1rjlaTzLz7hjhWtruZqZk1iU5EOCfTd2d2tPy8XxtyujenV+uGXHdeaz5cvC34SbFYu0lEWovI1yKyWkRWicjdVnpjEZklIvnW342sdBGRiSKyXkR+FJHeHmWNs/Lni8g4j/Q+IrLCOmeiWI2k/q6hVCL47eSlvL9gS9l7X23/sezvTEoShnRpVlav3wzpGLvKAGdn1A2eySEjujfn+WuyuHtodDvMA3n6Zz258GxXAPe38RG4tmd1Uij/i54G/mCM6Qb0B+4UkW7AeGCOMaYTMMd6DzAS6GT9uQ14FVw3fOBRoB/QF3jU46b/KnCrx3k5Vrq/ayiV0D77cQftxk8PaQ+GSFyTHf56QaGM0HlyTHeg/NNQXZuehmK5ereIcEXvTGrWiM+xPHf6CeAPje7KVX0y+aODQ3CD/kaMMTuNMUus10eANUArYAzwrpXtXWCs9XoM8J5xWQA0FJEWwAhgljHmgDHmIDALyLGO1TfGLDCuAdbveZXl6xpKJbQ3v90EwJb9xx29jp033topSXRpUb9C+lkRNlN505FE/nn2Xd09tBNZrV1NYWdZHesN03zPKbFDWGFTRNoB5wILgQxjzE7r0C4gw3rdCvBsPCuw0gKlF/hIJ8A1vOt1m4jkikju3r2RjytXKhoMlN29nVrSJ9mBgi8958w8Cs8bul0396q6EZDdfjmgbdnkvWj0m4QcJESkLvAf4HfGmHKLwltPAI7+Cwe6hjHmDWNMtjEmOz093clqKBUx923b817o1AJ2qSmuJqBefmYSR8IY3yOgItl8yPcF7CmmOnBPOnQvC+IWs1VgRSQFV4D4hzHmEyt5t9VUhPX3Hit9O+A5fTHTSguUnukjPdA1lEpsVnAo9rE8uJ3smkUNzi9iV11iRKRzL9wjsGr4eEoUOzfK9hLK6CYB3gbWGGOe9zg0DXCPUBoHTPVIv9Ea5dQfKLSajGYCw0WkkdVhPRyYaR07LCL9rWvd6FWWr2soldisb9/HT5U4Uny92vZ0Ji9/ZLjPdM+nB+8hvpFyooksnsx/4GLm/XEIb407L2jeFY9V/L2XWMvrRmvZcbdQ/k8aCPwSWCEiy6y0/wMmAP8WkZuBLcA11rEZwChgPXAc+BWAMeaAiDwJLLbyPWGMOWC9vgN4B0gFPrf+EOAaSiUcXy1Lp07b+ySx9skc3p+/hRvPb2tLeSk1hOeuzuLej5aDOfMzeH7rz2yUVunr/Hpwh7Ld4KqqFg1ckxcLjwdeimXdn0b6HGXlDhK+niScFDRIGGO+w/9kzKE+8hvgTj9lTQIm+UjPBSp8HTHG7Pd1DaUSmee38FM2NzfVTknmVht3cxOPhgxjpTjh/pwujpQbj5ICtN9ck+1/GG7bJmms3XUk6k9cOuNaqSjx1W78cW7FdY7iiQhktXYtTTGiewbdW9Ynu20jHrmsW4xrlrgC3eQDBe1EwTwAABfcSURBVMv3b+7Hiu2HAi717sQAMQ0SSkXJweOnKqRV9knijyM685eZFfehsIsIdGxWj/VPjSybbPexn9VlVWj8Leq48elRAfsb0uvV4uIuPmcB2LrNrbf4nF6oVBWUv+doxOeufTKHNo1dbf8Z9c8Me8zyGCd/QUffu6M1rVuLYd1831y8dW9Zn5vOb1f23v30U5X3S4g2f08S0e6QDpX+yysVZeG2CNx0fjtqpyQz6aZsbjq/HeNHnmmS8ByW+sEt/Xyen/vQJbx5Y3ZI15p+1yAeu7x72Xsnv6GCK4DNvXewsxeJMynJScz+vbNbsNpJg4RSURas3fjZK88p9/76vm0AV7PPY5d3Z6zHxjxOT1J2+rttt5b1I159NpF1bGb/FqxO0SChVBwZ3i2Da84rPxTUuxXCc5a2Ab6850I+v3uQI/Vxaka427CuzRwtv7pxYsKjdlwrFUcapdWskBboPt2nbSPbVmH1JdQQ0b1lfVbtOBw8o5df9LdnPkd152Qo1ycJpaLMYFheUBh6/gBfDp0MEBB6n0Sk27A6/aSiKk+DhFJRViPAbCpf98xYrmkU6k08knt9j1YVlx2vTuxcV8tJ2tykVJSFO2PW35NESnLifgv/fvzFNK1bsWmtOmkRYLe5eKJBQimHGWO8vpH7fzb4w3DXPgHnd2jCDxv2A763OF32yLC4Glcf7l4QrRqmOlSTBOLxz/e363rZUqQTo920uUkpBxw7eWZb0nA+uOn1XBPlstue2c69Q3rFvZ8bptWkfm3ndiMLl7+f8e6hnSqk3ThAO6s93ZfTmTEew5ojoTOulUowf5q+uuy19/1zRSid1tan/neXdEqIzt1SP0HinmFnV0gb3bOFw7VRdtLmJqUcsP/omXWaSo0hyeMm+th/V/s4o3rYPGF0rKsQN5zcKMhOGiSUcoDnkNCDx07x5erdYZ2fabXZt0yQtnvbtjBVcUeDhFIO8Bzl2vfpOXRpHt4yDFdnZ9Ksfi0uOjsx9mxv37QOa3cdiXU1EpKd8TVme1wrpcLj3Y+w41BR2OcP7twsIfojAP5ydVasq5Bw7PyndbLpSp8klHJAakr5jWES5WYfqbq1arD04WEUHCzispe+K3fstV/0JiU5iYz6iTEvQJWnQUIpB6R4TW6oDm32jerUpFGdihPkcnroaKZEps1NSjnA+8Ghqj9J+NK9ZfVedqOq0CChlANyNx8o976wqDhGNYmdCVecEzxTNebE1wYnnli1uUkpm73z/SbW7Y58q9JEl/fQJeTvOUrPzAaxrkr14eCDqgYJpWxUWmqqxGS5GXcNYuX20Jcz99Skbi2a1K0VPKNKCNrcpJSNJnyxNqz8N/Rrw1d/uMih2kSuW8v6FXbIU9WTPkkoZaOFmw4Ez+ThqZ/1dKgmKlHE+8g3DRJKhcHdAd0g1d4VWJ+5oifJ1XAEVHXmxD+3E/FGg4RSYch6/Eug/EJ1d01eys96t2JI52aU+lsONYjr+7axpX4qcfzPwPas2nGYG/pVful0J79eaJBQqpKmLd/BtOU7Yl0NlWCa1K3FO7/qG+tqBKUd10pF4Ku1u9m871jETw4AL1xrz25kSjlJnySUisD/vJMLwBe/GxRxGWPPrdxuZEpFgz5JKFUJOS98G+sqKOWooEFCRCaJyB4RWemR1lhEZolIvvV3IytdRGSiiKwXkR9FpLfHOeOs/PkiMs4jvY+IrLDOmSjWIjf+rqFUrOw/ejLWVVDKJyfXBgvlSeIdIMcrbTwwxxjTCZhjvQcYCXSy/twGvAquGz7wKNAP6As86nHTfxW41eO8nCDXUCom5qzZE+sqKBV1QYOEMWYe4D1DaAzwrvX6XWCsR/p7xmUB0FBEWgAjgFnGmAPGmIPALCDHOlbfGLPAuGaUvOdVlq9rKBV12w4cZ17+3kqX8/Cl3QB4+ee9g+RUKj5E2nGdYYzZab3eBWRYr1sB2zzyFVhpgdILfKQHuoZSFazaUcjmfccZfY79excs3nyAq1+bb0tZN1/QnpsvaG9LWUpFQ6VHNxljjIg4Oq882DVE5DZczVu0aaOTkqqj0RNdu6GNPmd0kJzhKSwq5vb382wtUymnODHjOtLRTbutpiKsv92NtdsBz1XBMq20QOmZPtIDXaMCY8wbxphsY0x2enpibByv4t/6PUfJevxL9h87FdH5E68/1+YaKeWbkzOuIw0S0wD3CKVxwFSP9ButUU79gUKryWgmMFxEGlkd1sOBmdaxwyLS3xrVdKNXWb6uoVQ563YfcaTcS57/plLnX57Vstz7567OqlR5SsVCKENgJwPzgc4iUiAiNwMTgGEikg9cYr0HmAFsBNYDbwJ3ABhjDgBPAoutP09YaVh53rLO2QB8bqX7u4ZS5eTH8QY/390/pOx11xb1YlgTpSITtE/CGHO9n0NDfeQ1wJ1+ypkETPKRngv08JG+39c1lPKWZOOz9tpdh7n5nVz++9sLbCkvs1Ea//n1+XycV0DX5rrns0o8OuNaJTzPeUTeaylNXbadBz5ZEXJZr87dwPZDRXyzLrw5EdcF2KCnT9tGPHNFT5LsjGZK+WDQPa6VqsBztumFf/mab+8bUpZ294fLANd+DYEUl5SSJMKMFa5R1/f8a3nQ6zZKS+HRy7pzfscmNKtXm47N6nJ+h6bc/O5i0uvp9p0qepzcikSDhEp4SR6fkIKDRXy/fj8XdGpKcUlpuXynS0opKi6hXu2KGwZ1evDzCmnB5D00rNzTwS2DzgJg/gPaSqqqDm1uUgnPuxXnRHEJAEXW3wCHTxRz70fL6fnYlzzwyY+Vut64AW0Z3Dldm49UtaBPEioufLFyJwPOakqDNNe3/FU7CunWon5EC5eVWjOKHplStiYlfZ+azYli15PF5EXb6JxRj5sGRjbz+bHLuzu6oJpS8USfJFTMFRw8zu0fLOGuD5cC8MOGfYye+B3vzd9SIe/fZuezed+xcmmnvTqrN+07Rrvx05my7Mxuce4A4fbYf1ezq/AE2w4c56K/fB20jsO7nVkVRgOEile6x7Wqktw38G0HjrNh71E27HHNe1i1o7BcvvzdR/jr7HX8dfY6Njw9iiRx3bB3FZ4ol2/O2tBGJvV/Zk5I+R6/vDsjujfny9W7Q8qvVLQ5+b1FnyRUzD0zYw0AG/cdY+j/+4aHp64CYNfh8vs3HCoqLnvd4f9m8M9FWwF4dNqqcvkWbfJetLhybhzQllo19KOiqif9P1/FnL9v/vPWlV+a27uf+MFPV3J2BKOSwvHAyC6ICI3q1HT0OkrFKw0SqtL+/v0m8raE/+3dGMOfv1gbxhkVn6lPeQ1ztVuH9LqOlq9UvNM+CVVpj/93NQCbJ4S3TPdfZv7Eq3M3BMxTUmpIth4hjBO9cgFMv+sCurdsENVrKlUZTnxC9ElCxcwrQQIEuPoe3MZNWuRkdcq0bZLGrHsurBAgnvpZD67o3crPWUrFjji4WLg+SaiY+GLlzuCZvBw7VRI8UyWlpiQz997BPoe53tCvLTf0a+t4HZSKJ/okoWx18nQJ7cZP58U5+X7zrCgo5PYPloRc5r0fBV9HyZ+Vj4/gMq99Hfy555KzWfNkjs6DUMqDBgllqyLr276/pqTComIue+m7sMr8OK+AldsLg2f0cknXZtStVYMXQ9ghrlaNJO4a2jHsayhV1WmQUJVyorh8E9BCa45CUbHvpqEpS7f7TA/m0hfDCywAv7m4U9nrp352ZsuSZ686p1y+56/J4qc/jdQnCJXwnBjcoX0SqlL+9/28cu/3HD5RIU+JtWxGcpL4nBnapXk91u6q/Bakj17WrWyk1as39KZX64Zlx37etw01koTLslqSVrMGI3s0p7ComMxGaZW+rlKxpjOuVdz6xmvC2zfr9pW9vuKV7wHo+dhMhjw3F4CpHuspudWrXfnvKqN6NudXA9vTt11jAIZ0aVbuuIhw7XltSKtZw7pmigYIpUKgQULZavaaM+sbLdl6iJJSw/FTJWw9cJxrXp9P3paDFc556ee9y15vemYUWZkNuKRrRoV8bt/eN4Q7BncA4Pq+bYAzi+69OS6bj28fQO2UZFt+HqWqO21uUrZZtu1QhTTPeQ6+1lRa/OAl5XZxExGm/uYCjDEs2nSAa99YUOGcurVq8McRnfnfizrwzbq9TF60tWyUeIPUFLKtpwmlVOVpkFARe37WunLvx778fdhl+NvmU0To277izX7KnQPL1lFqkJpC28auJqPebRqFfW2lqhqdca3iysQAcyHCdU12Jr3bNCyXJiK8f3PfsvepKcnlOqMBslo3ZM4fLuJXA9vZVhel1Bn6JKFi5u1x2WWvn70qy2eenq3OLI2x9JFhPvPoInxKOUeDhIqZi71GIPnSMK0mb96YTZKgndFKxYA2Nynb/e9FZ5H/1Ei/x0ef04K1YSx/MaxbBkMDjHZSSjlHnySU7R4Y2TXg8do1kvWpQCkb9WnbiInXn0szPwNBKkOfJJStptw5sOz1hWenlzt28wXtAWjdODWqdVKqqstslMblWS2pVzvF9rL1SUJFxHumNUC3FvXLjT5673/6MmfNbv7w0XIOHS/mlkHt6dW6ISN7NI9mVZVSlaBPEtXE8m2H2LL/mG3lfbqkoNz7r+8dzIy7B1XIN7RrBlecmwlAw9SaXJbVkhrJ+r+dUolCnySqgdMlpYyxJroteXgYja3JaJUxxWsNpvZN6/jN++Dortx9SSdSa2o/hFKJRoNEArrv4+U0rlOL8SO7hJR/5qoz6yn1fnIW4Jp/8OTYHhUmp/myq/AE/1lSwB2DOyAi7Dt6suzYsG4ZXJvdOuD5yUlCg1T720qVUs6TaG8u77Ts7GyTm5sb62pUyuETxaQkJfn85l1aajjLWg9p84TRPs8vLTUkJbmGlxpjaP/ADJ/5PH14W39OFJewZf9xHp22ilE9m7N06yF2FlZc+tuTvzoopRKLiOQZY7K90+P+SUJEcoC/AcnAW8aYCTGukq2MMRXmC5zz2JdkNkrlu/svrpD/xOkzm/l4BgO39XuOcsnz3/DkmO6s2XWEfy7cWnbsH7f044a3Fvqsx3VeC+nNWLEraN3H9AptW1ClVOKK6yAhIsnAy8AwoABYLCLTjDGrY1szeyzadIBrXp/PtN8M5JzM8s0+BQeLWFFQSM9M17IUx06eJiU5ieOnzgSJjfuOcuq0Yeqy7bw+b2O58x+euqrc+99e3JGBHZuyecJodhYW0TC1Ji/MWcffv9vMqZJSv3XMbJTKkM7NSKuVzOvfbCStZjJX9cnk1kFn0bqx7segVFUX181NIjIAeMwYM8J6/wCAMeYZf+dE2tz07BdrWbLVtdeBMR6rKRow1jt3erDfWalx5SkxhtJS98/i2qHNGNdrEWHNzsNl55ydURdBOF1ayoa9Z0YhZdSvxYFjpyguifzfacqdA0Pqe1BKVV+J2tzUCtjm8b4A6OedSURuA24DaNOmTUQXKim1buhW641gbQkoICSVbQ8oAkl+lpNwx46kJCHJyufZGiQiJItQYmVs0aA2X63dw+DO6dSq4RoWWiMpiZTkJNbuOkJW64Z0zqiLMfBRXgEN01Lo1Kwuize7gllW64as332Ebi3r07JhKv97YQfOzqhLcpJw/FQJdWrF+z+vUireVYm7iDHmDeANcD1JRFLGA6MCLyURa3+52vcqqf5ogFBK2SHeZzVtBzzHV2ZaaUoppaIg3oPEYqCTiLQXkZrAdcC0GNdJKaWqjbhukzDGnBaR3wAzcQ2BnWSMWRXkNKWUUjaJ6yABYIyZAQSfDaaUUsp28d7cpJRSKoY0SCillPJLg4RSSim/NEgopZTyK66X5YiEiOwFtsS6Hj40BfbFuhJhSsQ6g9Y72hKx3olYZ3C23m2NMeneiVUuSMQrEcn1tS5KPEvEOoPWO9oSsd6JWGeITb21uUkppZRfGiSUUkr5pUEiet6IdQUikIh1Bq13tCVivROxzhCDemufhFJKKb/0SUIppZRfGiSUUkr5pUEiQiLSWkS+FpHVIrJKRO620huLyCwRybf+bmSldxGR+SJyUkTu9SorR0R+EpH1IjI+Eertr5x4rrNHeckislREPnOqznbXW0QaisjHIrJWRNZYW/smQr3vscpYKSKTRaR2HNX7BhH5UURWiMgPIpLlUVZUPpN21dnRz6MxRv9E8AdoAfS2XtcD1gHdgGeB8Vb6eODP1utmwHnAU8C9HuUkAxuAs4CawHKgWwLU22c58Vxnj/J+D/wT+CwR/h+xjr0L3GK9rgk0jPd649p+eBOQar3/N3BTHNX7fKCR9XoksNB6HbXPpI11duzz6NgHpLr9AaYCw4CfgBYe/3A/eeV7zOuDNACY6fH+AeCBeK+3v3Livc64djecA1yMw0HCxv9HGlg3W4lmfW2ot3uP+sa4tiX4DBgeb/W20hsB263XMftMRlpnf+XYUSdtbrKBiLQDzgUWAhnGmJ3WoV1ARpDT3R8ktwIrzXGVrLe/chxlQ51fAO4DSp2onz+VrHd7YC/wd6uZ7C0RqeNUXT1Vpt7GmO3Ac8BWYCdQaIz50rHKeoig3jcDn1uvY/KZrGSd/ZVTaRokKklE6gL/AX5njDnsecy4QnpcjjG2q96ByrFbZessIpcCe4wxec7V0ud1K/u7rgH0Bl41xpwLHMPVBOEoG37fjYAxuIJcS6COiPzCoep6XjeseovIEFw33Pudrps/dtXZic+jBolKEJEUXP8g/zDGfGIl7xaRFtbxFsCeIMVsB1p7vM+00hxjU739leMIm+o8ELhcRDYDHwIXi8gHDlUZq1521LsAKDDGuL8ZfowraDjGpnpfAmwyxuw1xhQDn+BqU3dMuPUWkXOAt4Axxpj9VnJUP5M21dmxz6MGiQiJiABvA2uMMc97HJoGjLNej8PVNhjIYqCTiLQXkZrAdVYZjrCr3gHKsZ1ddTbGPGCMyTTGtMP1e/7KGOPYN1sb670L2CYina2kocBqm6tbxsb/t7cC/UUkzSpzKLDG7vq6hVtvEWmDK3D90hizziN/1D6TdtXZ0c9jNDpjquIf4AJcj4A/AsusP6OAJrg6RvOB2UBjK39zXN8IDwOHrNf1rWOjcI1G2AA8mAj19ldOPNfZq8zBOD+6yc7/R3oBuVZZU7BGuCRAvR8H1gIrgfeBWnFU77eAgx55cz3Kispn0q46O/l51GU5lFJK+aXNTUoppfzSIKGUUsovDRJKKaX80iChlFLKLw0SSiml/NIgoZRSyi8NEqraE9cy3HfEQT3mikh2kDy/E5G0aNVJKQ0SSkFDoEKQEJEaMahLML8DNEioqNEgoRRMADqIyDIRWSwi34rINKylL0RkiojkWZu53OY+SUSOishTIrJcRBaISIaVfrW4NtlZLiLz/F1URFJF5ENxbSL0KZDqcexVEcm1rvm4lXYXroXyvhaRr6204eLa8GeJiHxkLfCmlG10xrWq9qyllT8zxvQQkcHAdKCHMWaTdbyxMeaAiKTiWtfnImPMfhExwOXGmP+KyLPAYWPMn0RkBZBjjNkuIg2NMYf8XPf31nX+x1q0bQnQ3xiT63HNZFzLM9xljPnRWpww2xizT0Sa4lrHZ6Qx5piI3I9r2YsnHPtlqWpHnySUqmiRO0BY7hKR5cACXKuDdrLST+HaSAcgD2hnvf4eeEdEbsW1y5k/FwIfABhjfsS17o7bNSKyBFgKdMe1W5m3/lb69yKyDNdCcG1D+QGVClU8trkqFWvH3C+sJ4tLgAHGmOMiMhdw79NcbM48ipdgfZ6MMbeLSD9gNJAnIn2Mx5LOwYhIe+Be4DxjzEERecfjmuWyArOMMdeH88MpFQ59klAKjuDaF9iXBsBBK0B0wfXtPSAR6WCMWWiMeQTXjnKt/WSdB/zcOqcHcI6VXh9XoCq0+jlG+qnrAmCgiHS0yqgjImcHq59S4dAnCVXtWf0L34vISqAI2O1x+AvgdhFZg2vf4QUhFPkXEemE65v+HGC5n3yv4tqSdA2ufRbyrPosF5GluJbY3oar+crtDeALEdlhjBkiIjcBk0WklnX8IVxLXCtlC+24Vkop5Zc2NymllPJLm5uUcpiIjAD+7JW8yRjzs1jUR6lwaHOTUkopv7S5SSmllF8aJJRSSvmlQUIppZRfGiSUUkr59f8Bb01R7NIuTVUAAAAASUVORK5CYII=\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "As you can see, common table expressions (CTEs) let you shift a lot of your data cleaning into SQL. **That's an especially good thing in the case of BigQuery, because it is vastly faster than doing the work in Pandas.**" ], "metadata": { "id": "PZq7HuBsLQye" } }, { "cell_type": "markdown", "source": [ "### Joining data\n", "\n", "When our data lives across different tables, how do we analyze it? By\n", "JOINing the tables together. A `JOIN` combines rows in the left table with\n", "corresponding rows in the right table, where the meaning of “corresponding” is based on how we specify the join.\n", "\n", "GitHub is the most popular place to collaborate on software projects. A GitHub **repository** (or **repo**) is a collection of files associated with a specific project. Most repos on GitHub are shared under a specific legal license, which determines the legal restrictions on how they are used. **For our example, we're going to look at how many different files have been released under each license.** \n", "\n", "We'll work with two tables in the database. The first table is the `licenses` table, which provides the name of each GitHub repo (in the `repo_name` column) and its corresponding license. Here's a view of the first five rows." ], "metadata": { "id": "xkwgZ5D_NDae" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"github_repos\" dataset\n", "dataset_ref = client.dataset(\"github_repos\", project=\"bigquery-public-data\")\n", "\n", "# API request - fetch the dataset\n", "dataset = client.get_dataset(dataset_ref)\n", "\n", "# Construct a reference to the \"licenses\" table\n", "licenses_ref = dataset_ref.table(\"licenses\")\n", "\n", "# API request - fetch the table\n", "licenses_table = client.get_table(licenses_ref)\n", "\n", "# Preview the first five lines of the \"licenses\" table\n", "client.list_rows(licenses_table, max_results=5).to_dataframe()" ], "metadata": { "id": "TYTZKEmlLUTY", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "868ecbdf-bf33-4690-84b9-2625c33351f1" }, "execution_count": 42, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " repo_name license\n", "0 nbstreet/batteryAce artistic-2.0\n", "1 thecodersguild/wordpress-theming-workshop artistic-2.0\n", "2 hyeon1219e/freezing-octo-dubstep artistic-2.0\n", "3 mfinc/mfinc artistic-2.0\n", "4 gitpan/Map-Tube-NYC artistic-2.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
repo_namelicense
0nbstreet/batteryAceartistic-2.0
1thecodersguild/wordpress-theming-workshopartistic-2.0
2hyeon1219e/freezing-octo-dubstepartistic-2.0
3mfinc/mfincartistic-2.0
4gitpan/Map-Tube-NYCartistic-2.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"nbstreet/batteryAce\",\n\"artistic-2.0\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"thecodersguild/wordpress-theming-workshop\",\n\"artistic-2.0\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"hyeon1219e/freezing-octo-dubstep\",\n\"artistic-2.0\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"mfinc/mfinc\",\n\"artistic-2.0\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"gitpan/Map-Tube-NYC\",\n\"artistic-2.0\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"repo_name\"], [\"string\", \"license\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 42 } ] }, { "cell_type": "markdown", "source": [ "The second table is the `sample_files` table, which provides, among other information, the GitHub repo that each file belongs to (in the `repo_name` column). The first several rows of this table are printed below." ], "metadata": { "id": "1Xxrz8BxNMsF" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"sample_files\" table\n", "files_ref = dataset_ref.table(\"sample_files\")\n", "\n", "# API request - fetch the table\n", "files_table = client.get_table(files_ref)\n", "\n", "# Preview the first five lines of the \"sample_files\" table\n", "client.list_rows(files_table, max_results=5).to_dataframe()" ], "metadata": { "id": "gzyzOy7DNPti", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "b2ec39c1-620e-45c4-cf2c-7dd98ce3c01b" }, "execution_count": 38, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " repo_name ref \\\n", "0 git/git refs/heads/master \n", "1 np/ling refs/heads/master \n", "2 np/ling refs/heads/master \n", "3 np/ling refs/heads/master \n", "4 np/ling refs/heads/master \n", "\n", " path mode \\\n", "0 RelNotes 40960 \n", "1 tests/success/plug_compose.t/plug_compose.ll 40960 \n", "2 fixtures/strict-par-success/parallel_assoc_lef... 40960 \n", "3 fixtures/sequence/parallel_assoc_2tensor2_left.ll 40960 \n", "4 fixtures/success/my_dual.ll 40960 \n", "\n", " id \\\n", "0 62615ffa4e97803da96aefbc798ab50f949a8db7 \n", "1 0c1605e4b447158085656487dc477f7670c4bac1 \n", "2 b59bff84ec03d12fabd3b51a27ed7e39a180097e \n", "3 f29523e3fb65702d99478e429eac6f801f32152b \n", "4 38a3af095088f90dfc956cb990e893909c3ab286 \n", "\n", " symlink_target \n", "0 Documentation/RelNotes/2.10.0.txt \n", "1 ../../../fixtures/all/plug_compose.ll \n", "2 ../all/parallel_assoc_left.ll \n", "3 ../all/parallel_assoc_2tensor2_left.ll \n", "4 ../all/my_dual.ll " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
repo_namerefpathmodeidsymlink_target
0git/gitrefs/heads/masterRelNotes4096062615ffa4e97803da96aefbc798ab50f949a8db7Documentation/RelNotes/2.10.0.txt
1np/lingrefs/heads/mastertests/success/plug_compose.t/plug_compose.ll409600c1605e4b447158085656487dc477f7670c4bac1../../../fixtures/all/plug_compose.ll
2np/lingrefs/heads/masterfixtures/strict-par-success/parallel_assoc_lef...40960b59bff84ec03d12fabd3b51a27ed7e39a180097e../all/parallel_assoc_left.ll
3np/lingrefs/heads/masterfixtures/sequence/parallel_assoc_2tensor2_left.ll40960f29523e3fb65702d99478e429eac6f801f32152b../all/parallel_assoc_2tensor2_left.ll
4np/lingrefs/heads/masterfixtures/success/my_dual.ll4096038a3af095088f90dfc956cb990e893909c3ab286../all/my_dual.ll
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"git/git\",\n\"refs/heads/master\",\n\"RelNotes\",\n{\n 'v': 40960,\n 'f': \"40960\",\n },\n\"62615ffa4e97803da96aefbc798ab50f949a8db7\",\n\"Documentation/RelNotes/2.10.0.txt\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"np/ling\",\n\"refs/heads/master\",\n\"tests/success/plug_compose.t/plug_compose.ll\",\n{\n 'v': 40960,\n 'f': \"40960\",\n },\n\"0c1605e4b447158085656487dc477f7670c4bac1\",\n\"../../../fixtures/all/plug_compose.ll\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"np/ling\",\n\"refs/heads/master\",\n\"fixtures/strict-par-success/parallel_assoc_left.ll\",\n{\n 'v': 40960,\n 'f': \"40960\",\n },\n\"b59bff84ec03d12fabd3b51a27ed7e39a180097e\",\n\"../all/parallel_assoc_left.ll\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"np/ling\",\n\"refs/heads/master\",\n\"fixtures/sequence/parallel_assoc_2tensor2_left.ll\",\n{\n 'v': 40960,\n 'f': \"40960\",\n },\n\"f29523e3fb65702d99478e429eac6f801f32152b\",\n\"../all/parallel_assoc_2tensor2_left.ll\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"np/ling\",\n\"refs/heads/master\",\n\"fixtures/success/my_dual.ll\",\n{\n 'v': 40960,\n 'f': \"40960\",\n },\n\"38a3af095088f90dfc956cb990e893909c3ab286\",\n\"../all/my_dual.ll\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"repo_name\"], [\"string\", \"ref\"], [\"string\", \"path\"], [\"number\", \"mode\"], [\"string\", \"id\"], [\"string\", \"symlink_target\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 38 } ] }, { "cell_type": "markdown", "source": [ "Next, we write a query that uses information in both tables to determine how many files are released in each license." ], "metadata": { "id": "SIb2OHmnNV0H" } }, { "cell_type": "code", "source": [ "# Query to determine the number of files per license, sorted by number of files\n", "query = \"\"\"\n", " SELECT L.license, COUNT(1) AS number_of_files\n", " FROM `bigquery-public-data.github_repos.sample_files` AS sf\n", " INNER JOIN `bigquery-public-data.github_repos.licenses` AS L \n", " ON sf.repo_name = L.repo_name\n", " GROUP BY L.license\n", " ORDER BY number_of_files DESC\n", " \"\"\"\n", "\n", "# Set up the query (cancel the query if it would use too much of \n", "# your quota, with the limit set to 10 GB)\n", "safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)\n", "query_job = client.query(query, job_config=safe_config)\n", "\n", "# API request - run the query, and convert the results to a pandas DataFrame\n", "file_count_by_license = query_job.to_dataframe()" ], "metadata": { "id": "3WWRJmdWNYR4" }, "execution_count": 39, "outputs": [] }, { "cell_type": "markdown", "source": [ "It's a big query, and so we'll investigate each piece separately.\n", "\n", "![](https://i.imgur.com/QeufD01.png)\n", " \n", "We'll begin with the **JOIN** (highlighted in blue above). This specifies the sources of data and how to join them. We use **ON** to specify that we combine the tables by matching the values in the `repo_name` columns in the tables.\n", "\n", "Next, we'll talk about **SELECT** and **GROUP BY** (highlighted in yellow). The **GROUP BY** breaks the data into a different group for each license, before we **COUNT** the number of rows in the `sample_files` table that corresponds to each license. (Remember that you can count the number of rows with `COUNT(1)`.) \n", "\n", "Finally, the **ORDER BY** (highlighted in purple) sorts the results so that licenses with more files appear first.\n", "\n", "It was a big query, but it gave us a nice table summarizing how many files have been committed under each license: " ], "metadata": { "id": "WA0TW4FpNdQ9" } }, { "cell_type": "code", "source": [ "# Print the DataFrame\n", "file_count_by_license" ], "metadata": { "id": "4hGOBl6rNfve", "colab": { "base_uri": "https://localhost:8080/", "height": 413 }, "outputId": "cbb2b05f-1e9a-4624-d877-7d00226d9053" }, "execution_count": 40, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " license number_of_files\n", "0 mit 20408848\n", "1 gpl-2.0 16440828\n", "2 apache-2.0 7114054\n", "3 gpl-3.0 4840103\n", "4 bsd-3-clause 3149733\n", "5 agpl-3.0 1321015\n", "6 lgpl-2.1 775792\n", "7 bsd-2-clause 687381\n", "8 lgpl-3.0 569941\n", "9 mpl-2.0 458331\n", "10 cc0-1.0 406823\n", "11 epl-1.0 312269\n", "12 unlicense 208494\n", "13 artistic-2.0 147904\n", "14 isc 118063" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
licensenumber_of_files
0mit20408848
1gpl-2.016440828
2apache-2.07114054
3gpl-3.04840103
4bsd-3-clause3149733
5agpl-3.01321015
6lgpl-2.1775792
7bsd-2-clause687381
8lgpl-3.0569941
9mpl-2.0458331
10cc0-1.0406823
11epl-1.0312269
12unlicense208494
13artistic-2.0147904
14isc118063
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"mit\",\n{\n 'v': 20408848,\n 'f': \"20408848\",\n }],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"gpl-2.0\",\n{\n 'v': 16440828,\n 'f': \"16440828\",\n }],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"apache-2.0\",\n{\n 'v': 7114054,\n 'f': \"7114054\",\n }],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"gpl-3.0\",\n{\n 'v': 4840103,\n 'f': \"4840103\",\n }],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"bsd-3-clause\",\n{\n 'v': 3149733,\n 'f': \"3149733\",\n }],\n [{\n 'v': 5,\n 'f': \"5\",\n },\n\"agpl-3.0\",\n{\n 'v': 1321015,\n 'f': \"1321015\",\n }],\n [{\n 'v': 6,\n 'f': \"6\",\n },\n\"lgpl-2.1\",\n{\n 'v': 775792,\n 'f': \"775792\",\n }],\n [{\n 'v': 7,\n 'f': \"7\",\n },\n\"bsd-2-clause\",\n{\n 'v': 687381,\n 'f': \"687381\",\n }],\n [{\n 'v': 8,\n 'f': \"8\",\n },\n\"lgpl-3.0\",\n{\n 'v': 569941,\n 'f': \"569941\",\n }],\n [{\n 'v': 9,\n 'f': \"9\",\n },\n\"mpl-2.0\",\n{\n 'v': 458331,\n 'f': \"458331\",\n }],\n [{\n 'v': 10,\n 'f': \"10\",\n },\n\"cc0-1.0\",\n{\n 'v': 406823,\n 'f': \"406823\",\n }],\n [{\n 'v': 11,\n 'f': \"11\",\n },\n\"epl-1.0\",\n{\n 'v': 312269,\n 'f': \"312269\",\n }],\n [{\n 'v': 12,\n 'f': \"12\",\n },\n\"unlicense\",\n{\n 'v': 208494,\n 'f': \"208494\",\n }],\n [{\n 'v': 13,\n 'f': \"13\",\n },\n\"artistic-2.0\",\n{\n 'v': 147904,\n 'f': \"147904\",\n }],\n [{\n 'v': 14,\n 'f': \"14\",\n },\n\"isc\",\n{\n 'v': 118063,\n 'f': \"118063\",\n }]],\n columns: [[\"number\", \"index\"], [\"string\", \"license\"], [\"number\", \"number_of_files\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 40 } ] }, { "cell_type": "markdown", "source": [ "There are a few more types of JOIN, along with how to use UNIONs to pull information from multiple tables. We'll work with the [Hacker News](https://www.kaggle.com/hacker-news/hacker-news) dataset. We begin by reviewing the first several rows of the `comments` table." ], "metadata": { "id": "bSlj8eo5PZ85" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"hacker_news\" dataset\n", "dataset_ref = client.dataset(\"hacker_news\", project=\"bigquery-public-data\")\n", "\n", "# API request - fetch the dataset\n", "dataset = client.get_dataset(dataset_ref)\n", "\n", "# Construct a reference to the \"comments\" table\n", "table_ref = dataset_ref.table(\"comments\")\n", "\n", "# API request - fetch the table\n", "table = client.get_table(table_ref)\n", "\n", "# Preview the first five lines of the table\n", "client.list_rows(table, max_results=5).to_dataframe()" ], "metadata": { "id": "ZC-UGEr3Paq0", "colab": { "base_uri": "https://localhost:8080/", "height": 346 }, "outputId": "b36c742d-ad93-496e-abb2-5bec82dfe7b0" }, "execution_count": 44, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " id by author time time_ts \\\n", "0 2701393 5l 5l 1309184881 2011-06-27 14:28:01+00:00 \n", "1 5811403 99 99 1370234048 2013-06-03 04:34:08+00:00 \n", "2 21623 AF AF 1178992400 2007-05-12 17:53:20+00:00 \n", "3 10159727 EA EA 1441206574 2015-09-02 15:09:34+00:00 \n", "4 2988424 Iv Iv 1315853580 2011-09-12 18:53:00+00:00 \n", "\n", " text parent deleted dead \\\n", "0 And the glazier who fixed all the broken windo... 2701243 None None \n", "1 Does canada have the equivalent of H1B/Green c... 5804452 None None \n", "2 Speaking of Rails, there are other options in ... 21611 None None \n", "3 Humans and large livestock (and maybe even pet... 10159396 None None \n", "4 I must say I reacted in the same way when I re... 2988179 None None \n", "\n", " ranking \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idbyauthortimetime_tstextparentdeleteddeadranking
027013935l5l13091848812011-06-27 14:28:01+00:00And the glazier who fixed all the broken windo...2701243NoneNone0
15811403999913702340482013-06-03 04:34:08+00:00Does canada have the equivalent of H1B/Green c...5804452NoneNone0
221623AFAF11789924002007-05-12 17:53:20+00:00Speaking of Rails, there are other options in ...21611NoneNone0
310159727EAEA14412065742015-09-02 15:09:34+00:00Humans and large livestock (and maybe even pet...10159396NoneNone0
42988424IvIv13158535802011-09-12 18:53:00+00:00I must say I reacted in the same way when I re...2988179NoneNone0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 2701393,\n 'f': \"2701393\",\n },\n\"5l\",\n\"5l\",\n{\n 'v': 1309184881,\n 'f': \"1309184881\",\n },\n\"2011-06-27 14:28:01+00:00\",\n\"And the glazier who fixed all the broken windows also left his money to good causes.\",\n{\n 'v': 2701243,\n 'f': \"2701243\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n }],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 5811403,\n 'f': \"5811403\",\n },\n\"99\",\n\"99\",\n{\n 'v': 1370234048,\n 'f': \"1370234048\",\n },\n\"2013-06-03 04:34:08+00:00\",\n\"Does canada have the equivalent of H1B/Green card for work sponsorship? What do you think of that?\",\n{\n 'v': 5804452,\n 'f': \"5804452\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n }],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': 21623,\n 'f': \"21623\",\n },\n\"AF\",\n\"AF\",\n{\n 'v': 1178992400,\n 'f': \"1178992400\",\n },\n\"2007-05-12 17:53:20+00:00\",\n\"Speaking of Rails, there are other options in the Python world besides Django.

Pylons is a very Rails-y framework with the difference being that it is made to be easy to customize. In Rails if you don't like something you are going to have a hard time changing it out unless you are a good hacker. In Pylons that is easy, and you've got access to Python's vastly better platform (speed, Unicode support) and libraries.

If you are an absolute beginning programmer it might be kind of hard to pick up, but if you've programmed a bit or you've used one or two web frameworks (especially Rails) Pylons won't be hard to learn.

http://pylonshq.com/<\\/a>\",\n{\n 'v': 21611,\n 'f': \"21611\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n }],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': 10159727,\n 'f': \"10159727\",\n },\n\"EA\",\n\"EA\",\n{\n 'v': 1441206574,\n 'f': \"1441206574\",\n },\n\"2015-09-02 15:09:34+00:00\",\n\"Humans and large livestock (and maybe even pets) will have health monitoring devices embedded into their bodies in the near future. The devices will save the insurance companies money. Savings on insurance premiums will be the incentive to encourage mass adoption by citizens and owners of livestock.\",\n{\n 'v': 10159396,\n 'f': \"10159396\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n }],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': 2988424,\n 'f': \"2988424\",\n },\n\"Iv\",\n\"Iv\",\n{\n 'v': 1315853580,\n 'f': \"1315853580\",\n },\n\"2011-09-12 18:53:00+00:00\",\n\"I must say I reacted in the same way when I read about Madoff. The fact that people who are supposed to inspect investments would fall for such a scheme was one of the first nails that was put in the esteem I had for economy specialists.\",\n{\n 'v': 2988179,\n 'f': \"2988179\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': null,\n 'f': \"null\",\n },\n{\n 'v': 0,\n 'f': \"0\",\n }]],\n columns: [[\"number\", \"index\"], [\"number\", \"id\"], [\"string\", \"by\"], [\"string\", \"author\"], [\"number\", \"time\"], [\"string\", \"time_ts\"], [\"string\", \"text\"], [\"number\", \"parent\"], [\"number\", \"deleted\"], [\"number\", \"dead\"], [\"number\", \"ranking\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 44 } ] }, { "cell_type": "code", "source": [ "# Construct a reference to the \"stories\" table\n", "table_ref = dataset_ref.table(\"stories\")\n", "\n", "# API request - fetch the table\n", "table = client.get_table(table_ref)\n", "\n", "# Preview the first five lines of the table\n", "client.list_rows(table, max_results=5).to_dataframe()" ], "metadata": { "id": "GG9M1WJYPgrg", "colab": { "base_uri": "https://localhost:8080/", "height": 495 }, "outputId": "8ecf0e6e-ee3c-43fd-a92d-614da5ef5dd5" }, "execution_count": 45, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " id by score time time_ts \\\n", "0 6940813 sarath237 0 1387536270 2013-12-20 10:44:30+00:00 \n", "1 6991401 123123321321 0 1388508751 2013-12-31 16:52:31+00:00 \n", "2 1531556 ssn 0 1279617234 2010-07-20 09:13:54+00:00 \n", "3 5012398 hoju 0 1357387877 2013-01-05 12:11:17+00:00 \n", "4 7214182 kogir 0 1401561740 2014-05-31 18:42:20+00:00 \n", "\n", " title \\\n", "0 Sheryl Brindo Hot Pics \n", "1 Are you people also put off by the culture of ... \n", "2 New UI for Google Image Search \n", "3 Historic website screenshots \n", "4 Placeholder \n", "\n", " url \\\n", "0 http://www.youtube.com/watch?v=ym1cyxneB0Y \n", "1 \n", "2 http://googlesystem.blogspot.com/2010/07/googl... \n", "3 http://webscraping.com/blog/Generate-website-s... \n", "4 \n", "\n", " text deleted dead \\\n", "0 Sheryl Brindo Hot Pics None True \n", "1 They're pretty explicitly 'startup f... None True \n", "2 Again following on Bing's lead. None None \n", "3 Python script to generate historic screenshots... None None \n", "4 Mind the gap. None None \n", "\n", " descendants author \n", "0 NaN sarath237 \n", "1 NaN 123123321321 \n", "2 0.0 ssn \n", "3 0.0 hoju \n", "4 0.0 kogir " ], "text/html": [ "\n", "

\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idbyscoretimetime_tstitleurltextdeleteddeaddescendantsauthor
06940813sarath237013875362702013-12-20 10:44:30+00:00Sheryl Brindo Hot Picshttp://www.youtube.com/watch?v=ym1cyxneB0YSheryl Brindo Hot PicsNoneTrueNaNsarath237
16991401123123321321013885087512013-12-31 16:52:31+00:00Are you people also put off by the culture of ...They&#x27;re pretty explicitly &#x27;startup f...NoneTrueNaN123123321321
21531556ssn012796172342010-07-20 09:13:54+00:00New UI for Google Image Searchhttp://googlesystem.blogspot.com/2010/07/googl...Again following on Bing's lead.NoneNone0.0ssn
35012398hoju013573878772013-01-05 12:11:17+00:00Historic website screenshotshttp://webscraping.com/blog/Generate-website-s...Python script to generate historic screenshots...NoneNone0.0hoju
47214182kogir014015617402014-05-31 18:42:20+00:00PlaceholderMind the gap.NoneNone0.0kogir
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 6940813,\n 'f': \"6940813\",\n },\n\"sarath237\",\n{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 1387536270,\n 'f': \"1387536270\",\n },\n\"2013-12-20 10:44:30+00:00\",\n\" Sheryl Brindo Hot Pics \",\n\"http://www.youtube.com/watch?v=ym1cyxneB0Y\",\n\" Sheryl Brindo Hot Pics\",\n{\n 'v': null,\n 'f': \"null\",\n },\ntrue,\n{\n 'v': NaN,\n 'f': \"NaN\",\n },\n\"sarath237\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 6991401,\n 'f': \"6991401\",\n },\n\"123123321321\",\n{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 1388508751,\n 'f': \"1388508751\",\n },\n\"2013-12-31 16:52:31+00:00\",\n\"Are you people also put off by the culture of startup incubators?\",\n\"\",\n\"They're pretty explicitly 'startup factories' where the already-wealthy can capitalize on up-and-coming products and services. They take something that's appealing to people because of the freedom and wealth it provides and then turn it into a way to capitalize on the labour and ideas of others (even if the ideas themselves aren't necessarily that special).

Then on top of that, people have to put up an act to fit in to the culture, making it basically like work.

Ultimately they're a very useful service and all, but they are also what I just described. Is it something we just have to live with? What are your thoughts?\",\n{\n 'v': null,\n 'f': \"null\",\n },\ntrue,\n{\n 'v': NaN,\n 'f': \"NaN\",\n },\n\"123123321321\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': 1531556,\n 'f': \"1531556\",\n },\n\"ssn\",\n{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 1279617234,\n 'f': \"1279617234\",\n },\n\"2010-07-20 09:13:54+00:00\",\n\"New UI for Google Image Search\",\n\"http://googlesystem.blogspot.com/2010/07/google-tests-new-image-search-interface.html\",\n\"Again following on Bing's lead.\",\n{\n 'v': null,\n 'f': \"null\",\n },\nnull,\n{\n 'v': 0.0,\n 'f': \"0.0\",\n },\n\"ssn\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': 5012398,\n 'f': \"5012398\",\n },\n\"hoju\",\n{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 1357387877,\n 'f': \"1357387877\",\n },\n\"2013-01-05 12:11:17+00:00\",\n\"Historic website screenshots\",\n\"http://webscraping.com/blog/Generate-website-screenshot-history/\",\n\"Python script to generate historic screenshots of a website.\",\n{\n 'v': null,\n 'f': \"null\",\n },\nnull,\n{\n 'v': 0.0,\n 'f': \"0.0\",\n },\n\"hoju\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': 7214182,\n 'f': \"7214182\",\n },\n\"kogir\",\n{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 1401561740,\n 'f': \"1401561740\",\n },\n\"2014-05-31 18:42:20+00:00\",\n\"Placeholder\",\n\"\",\n\"Mind the gap.\",\n{\n 'v': null,\n 'f': \"null\",\n },\nnull,\n{\n 'v': 0.0,\n 'f': \"0.0\",\n },\n\"kogir\"]],\n columns: [[\"number\", \"index\"], [\"number\", \"id\"], [\"string\", \"by\"], [\"number\", \"score\"], [\"number\", \"time\"], [\"string\", \"time_ts\"], [\"string\", \"title\"], [\"string\", \"url\"], [\"string\", \"text\"], [\"number\", \"deleted\"], [\"string\", \"dead\"], [\"number\", \"descendants\"], [\"string\", \"author\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 45 } ] }, { "cell_type": "markdown", "source": [ "The query below pulls information from the `stories` and `comments` tables to create a table showing all stories posted on January 1, 2012, along with the corresponding number of comments. We use a **LEFT JOIN** so that the results include stories that didn't receive any comments." ], "metadata": { "id": "w-36FLhDPd2K" } }, { "cell_type": "code", "source": [ "# Query to select all stories posted on January 1, 2012, with number of comments\n", "join_query = \"\"\"\n", " WITH c AS\n", " (\n", " SELECT parent, COUNT(*) as num_comments\n", " FROM `bigquery-public-data.hacker_news.comments` \n", " GROUP BY parent\n", " )\n", " SELECT s.id as story_id, s.by, s.title, c.num_comments\n", " FROM `bigquery-public-data.hacker_news.stories` AS s\n", " LEFT JOIN c\n", " ON s.id = c.parent\n", " WHERE EXTRACT(DATE FROM s.time_ts) = '2012-01-01'\n", " ORDER BY c.num_comments DESC\n", " \"\"\"\n", "\n", "# Run the query, and return a pandas DataFrame\n", "join_result = client.query(join_query).result().to_dataframe()\n", "join_result.head()" ], "metadata": { "id": "WU9bl9oYPlUj", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "f953b5e3-cb52-4565-9569-0726252c88f6" }, "execution_count": 46, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " story_id by title \\\n", "0 3412900 whoishiring Ask HN: Who is Hiring? (January 2012) \n", "1 3412901 whoishiring Ask HN: Freelancer? Seeking freelancer? (Janua... \n", "2 3412643 jemeshsu Avoid Apress \n", "3 3414012 ramanujam Impress.js - a Prezi like implementation using... \n", "4 3412891 Brajeshwar There's no shame in code that is simply \"good ... \n", "\n", " num_comments \n", "0 154.0 \n", "1 97.0 \n", "2 30.0 \n", "3 27.0 \n", "4 27.0 " ], "text/html": [ "\n", "

\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
story_idbytitlenum_comments
03412900whoishiringAsk HN: Who is Hiring? (January 2012)154.0
13412901whoishiringAsk HN: Freelancer? Seeking freelancer? (Janua...97.0
23412643jemeshsuAvoid Apress30.0
33414012ramanujamImpress.js - a Prezi like implementation using...27.0
43412891BrajeshwarThere's no shame in code that is simply \"good ...27.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 3412900,\n 'f': \"3412900\",\n },\n\"whoishiring\",\n\"Ask HN: Who is Hiring? (January 2012)\",\n{\n 'v': 154.0,\n 'f': \"154.0\",\n }],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 3412901,\n 'f': \"3412901\",\n },\n\"whoishiring\",\n\"Ask HN: Freelancer? Seeking freelancer? (January 2012)\",\n{\n 'v': 97.0,\n 'f': \"97.0\",\n }],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': 3412643,\n 'f': \"3412643\",\n },\n\"jemeshsu\",\n\"Avoid Apress\",\n{\n 'v': 30.0,\n 'f': \"30.0\",\n }],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': 3414012,\n 'f': \"3414012\",\n },\n\"ramanujam\",\n\"Impress.js - a Prezi like implementation using CSS3 3D transformations\",\n{\n 'v': 27.0,\n 'f': \"27.0\",\n }],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': 3412891,\n 'f': \"3412891\",\n },\n\"Brajeshwar\",\n\"There's no shame in code that is simply \\\"good enough\\\"\",\n{\n 'v': 27.0,\n 'f': \"27.0\",\n }]],\n columns: [[\"number\", \"index\"], [\"number\", \"story_id\"], [\"string\", \"by\"], [\"string\", \"title\"], [\"number\", \"num_comments\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 46 } ] }, { "cell_type": "markdown", "source": [ "Since the results are ordered by the `num_comments` column, stories without comments appear at the end of the DataFrame. (Remember that **NaN** stands for \"not a number\".)" ], "metadata": { "id": "s3F7dqryPnua" } }, { "cell_type": "code", "source": [ "# None of these stories received any comments\n", "join_result.tail()" ], "metadata": { "id": "EbbScrGpPqTy", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "31d2e733-a524-4c1c-d96e-5fb5ff7adb43" }, "execution_count": 47, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " story_id by \\\n", "439 3412721 kooljp \n", "440 3413606 willvarfar \n", "441 3413159 see_cloudtweaks \n", "442 3412972 abionic \n", "443 3412388 deviceguru \n", "\n", " title num_comments \n", "439 Carolina Panthers vs New Orleans Saints Live S... NaN \n", "440 Poll: what is your (Lipson-Shiu) personality t... NaN \n", "441 IBM Cloud Computing: Overview - United States NaN \n", "442 Is SPLUNK eating up your disk space, might be ... NaN \n", "443 Google TV 2.0 screenshot tour NaN " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
story_idbytitlenum_comments
4393412721kooljpCarolina Panthers vs New Orleans Saints Live S...NaN
4403413606willvarfarPoll: what is your (Lipson-Shiu) personality t...NaN
4413413159see_cloudtweaksIBM Cloud Computing: Overview - United StatesNaN
4423412972abionicIs SPLUNK eating up your disk space, might be ...NaN
4433412388deviceguruGoogle TV 2.0 screenshot tourNaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 439,\n 'f': \"439\",\n },\n{\n 'v': 3412721,\n 'f': \"3412721\",\n },\n\"kooljp\",\n\"Carolina Panthers vs New Orleans Saints Live Stream NFL \",\n{\n 'v': NaN,\n 'f': \"NaN\",\n }],\n [{\n 'v': 440,\n 'f': \"440\",\n },\n{\n 'v': 3413606,\n 'f': \"3413606\",\n },\n\"willvarfar\",\n\"Poll: what is your (Lipson-Shiu) personality type?\",\n{\n 'v': NaN,\n 'f': \"NaN\",\n }],\n [{\n 'v': 441,\n 'f': \"441\",\n },\n{\n 'v': 3413159,\n 'f': \"3413159\",\n },\n\"see_cloudtweaks\",\n\"IBM Cloud Computing: Overview - United States\",\n{\n 'v': NaN,\n 'f': \"NaN\",\n }],\n [{\n 'v': 442,\n 'f': \"442\",\n },\n{\n 'v': 3412972,\n 'f': \"3412972\",\n },\n\"abionic\",\n\"Is SPLUNK eating up your disk space, might be index size\",\n{\n 'v': NaN,\n 'f': \"NaN\",\n }],\n [{\n 'v': 443,\n 'f': \"443\",\n },\n{\n 'v': 3412388,\n 'f': \"3412388\",\n },\n\"deviceguru\",\n\"Google TV 2.0 screenshot tour\",\n{\n 'v': NaN,\n 'f': \"NaN\",\n }]],\n columns: [[\"number\", \"index\"], [\"number\", \"story_id\"], [\"string\", \"by\"], [\"string\", \"title\"], [\"number\", \"num_comments\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 47 } ] }, { "cell_type": "markdown", "source": [ "As you've seen, JOINs horizontally combine results from different tables. If you instead would like to vertically concatenate columns, you can do so with a `UNION`. \n", "\n", "Next, we write a query to select all usernames corresponding to users who wrote stories or comments on January 1, 2014. We use **UNION DISTINCT** (instead of **UNION ALL**) to ensure that each user appears in the table at most once." ], "metadata": { "id": "nZ65-rWPPsPf" } }, { "cell_type": "code", "source": [ "# Query to select all users who posted stories or comments on January 1, 2014\n", "union_query = \"\"\"\n", " SELECT c.by\n", " FROM `bigquery-public-data.hacker_news.comments` AS c\n", " WHERE EXTRACT(DATE FROM c.time_ts) = '2014-01-01'\n", " UNION DISTINCT\n", " SELECT s.by\n", " FROM `bigquery-public-data.hacker_news.stories` AS s\n", " WHERE EXTRACT(DATE FROM s.time_ts) = '2014-01-01'\n", " \"\"\"\n", "\n", "# Run the query, and return a pandas DataFrame\n", "union_result = client.query(union_query).result().to_dataframe()\n", "union_result.head()" ], "metadata": { "id": "W7SueEUcPt5P", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "14eb4514-614e-4b88-b60e-f200412fe994" }, "execution_count": 48, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " by\n", "0 kawsper\n", "1 mayrund\n", "2 webmaven\n", "3 kmfrk\n", "4 rbobby" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
by
0kawsper
1mayrund
2webmaven
3kmfrk
4rbobby
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"kawsper\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"mayrund\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"webmaven\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"kmfrk\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"rbobby\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"by\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 48 } ] }, { "cell_type": "markdown", "source": [ "To get the number of users who posted on January 1, 2014, we need only take the length of the DataFrame." ], "metadata": { "id": "pHWr22nyPwL4" } }, { "cell_type": "code", "source": [ "# Number of users who posted stories or comments on January 1, 2014\n", "len(union_result)" ], "metadata": { "id": "fXlfo6uOPyWw", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "8bc18318-2b55-4e8d-d205-e5b4288df199" }, "execution_count": 49, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "2282" ] }, "metadata": {}, "execution_count": 49 } ] }, { "cell_type": "markdown", "source": [ "### Analytic Function\n", "\n", "You can also define analytic functions, which also operate on a set of rows like aggregation function. However, unlike aggregate functions, analytic functions return a (potentially different) value for each row in the original table. Analytic functions allow us to perform complex calculations with relatively straightforward syntax. For instance, we can quickly calculate moving averages and running totals, among other quantities.\n", "\n", "We'll work with the [San Francisco Open Data](https://www.kaggle.com/datasf/san-francisco) dataset." ], "metadata": { "id": "pWi_mgaHRr4d" } }, { "cell_type": "code", "source": [ "# Construct a reference to the \"san_francisco\" dataset\n", "dataset_ref = client.dataset(\"san_francisco\", project=\"bigquery-public-data\")\n", "\n", "# API request - fetch the dataset\n", "dataset = client.get_dataset(dataset_ref)\n", "\n", "# Construct a reference to the \"bikeshare_trips\" table\n", "table_ref = dataset_ref.table(\"bikeshare_trips\")\n", "\n", "# API request - fetch the table\n", "table = client.get_table(table_ref)\n", "\n", "# Preview the first five lines of the table\n", "client.list_rows(table, max_results=5).to_dataframe()" ], "metadata": { "id": "toMZH_UbRvHH", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "8ed30442-ab7f-45cd-d630-f27bc1969a47" }, "execution_count": 50, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " trip_id duration_sec start_date start_station_name \\\n", "0 944732 2618 2015-09-24 17:22:00+00:00 Mezes \n", "1 984595 5957 2015-10-25 18:12:00+00:00 Mezes \n", "2 984596 5913 2015-10-25 18:13:00+00:00 Mezes \n", "3 1129385 6079 2016-03-18 10:33:00+00:00 Mezes \n", "4 1030383 5780 2015-12-06 10:52:00+00:00 Mezes \n", "\n", " start_station_id end_date end_station_name \\\n", "0 83 2015-09-24 18:06:00+00:00 Mezes \n", "1 83 2015-10-25 19:51:00+00:00 Mezes \n", "2 83 2015-10-25 19:51:00+00:00 Mezes \n", "3 83 2016-03-18 12:14:00+00:00 Mezes \n", "4 83 2015-12-06 12:28:00+00:00 Mezes \n", "\n", " end_station_id bike_number zip_code subscriber_type \n", "0 83 653 94063 Customer \n", "1 83 52 nil Customer \n", "2 83 121 nil Customer \n", "3 83 208 94070 Customer \n", "4 83 44 94064 Customer " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trip_idduration_secstart_datestart_station_namestart_station_idend_dateend_station_nameend_station_idbike_numberzip_codesubscriber_type
094473226182015-09-24 17:22:00+00:00Mezes832015-09-24 18:06:00+00:00Mezes8365394063Customer
198459559572015-10-25 18:12:00+00:00Mezes832015-10-25 19:51:00+00:00Mezes8352nilCustomer
298459659132015-10-25 18:13:00+00:00Mezes832015-10-25 19:51:00+00:00Mezes83121nilCustomer
3112938560792016-03-18 10:33:00+00:00Mezes832016-03-18 12:14:00+00:00Mezes8320894070Customer
4103038357802015-12-06 10:52:00+00:00Mezes832015-12-06 12:28:00+00:00Mezes834494064Customer
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 944732,\n 'f': \"944732\",\n },\n{\n 'v': 2618,\n 'f': \"2618\",\n },\n\"2015-09-24 17:22:00+00:00\",\n\"Mezes\",\n{\n 'v': 83,\n 'f': \"83\",\n },\n\"2015-09-24 18:06:00+00:00\",\n\"Mezes\",\n{\n 'v': 83,\n 'f': \"83\",\n },\n{\n 'v': 653,\n 'f': \"653\",\n },\n\"94063\",\n\"Customer\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 984595,\n 'f': \"984595\",\n },\n{\n 'v': 5957,\n 'f': \"5957\",\n },\n\"2015-10-25 18:12:00+00:00\",\n\"Mezes\",\n{\n 'v': 83,\n 'f': \"83\",\n },\n\"2015-10-25 19:51:00+00:00\",\n\"Mezes\",\n{\n 'v': 83,\n 'f': \"83\",\n },\n{\n 'v': 52,\n 'f': \"52\",\n },\n\"nil\",\n\"Customer\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': 984596,\n 'f': \"984596\",\n },\n{\n 'v': 5913,\n 'f': \"5913\",\n },\n\"2015-10-25 18:13:00+00:00\",\n\"Mezes\",\n{\n 'v': 83,\n 'f': \"83\",\n },\n\"2015-10-25 19:51:00+00:00\",\n\"Mezes\",\n{\n 'v': 83,\n 'f': \"83\",\n },\n{\n 'v': 121,\n 'f': \"121\",\n },\n\"nil\",\n\"Customer\"],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': 1129385,\n 'f': \"1129385\",\n },\n{\n 'v': 6079,\n 'f': \"6079\",\n },\n\"2016-03-18 10:33:00+00:00\",\n\"Mezes\",\n{\n 'v': 83,\n 'f': \"83\",\n },\n\"2016-03-18 12:14:00+00:00\",\n\"Mezes\",\n{\n 'v': 83,\n 'f': \"83\",\n },\n{\n 'v': 208,\n 'f': \"208\",\n },\n\"94070\",\n\"Customer\"],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': 1030383,\n 'f': \"1030383\",\n },\n{\n 'v': 5780,\n 'f': \"5780\",\n },\n\"2015-12-06 10:52:00+00:00\",\n\"Mezes\",\n{\n 'v': 83,\n 'f': \"83\",\n },\n\"2015-12-06 12:28:00+00:00\",\n\"Mezes\",\n{\n 'v': 83,\n 'f': \"83\",\n },\n{\n 'v': 44,\n 'f': \"44\",\n },\n\"94064\",\n\"Customer\"]],\n columns: [[\"number\", \"index\"], [\"number\", \"trip_id\"], [\"number\", \"duration_sec\"], [\"string\", \"start_date\"], [\"string\", \"start_station_name\"], [\"number\", \"start_station_id\"], [\"string\", \"end_date\"], [\"string\", \"end_station_name\"], [\"number\", \"end_station_id\"], [\"number\", \"bike_number\"], [\"string\", \"zip_code\"], [\"string\", \"subscriber_type\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 50 } ] }, { "cell_type": "markdown", "source": [ "Each row of the table corresponds to a different bike trip, and we can use an analytic function to **calculate the cumulative number of trips for each date in 2015.**" ], "metadata": { "id": "nY8VBY-8Rx8w" } }, { "cell_type": "code", "source": [ "# Query to count the (cumulative) number of trips per day\n", "num_trips_query = \"\"\"\n", " WITH trips_by_day AS\n", " (\n", " SELECT DATE(start_date) AS trip_date,\n", " COUNT(*) as num_trips\n", " FROM `bigquery-public-data.san_francisco.bikeshare_trips`\n", " WHERE EXTRACT(YEAR FROM start_date) = 2015\n", " GROUP BY trip_date\n", " )\n", " SELECT *,\n", " SUM(num_trips) \n", " OVER (\n", " ORDER BY trip_date\n", " ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW\n", " ) AS cumulative_trips\n", " FROM trips_by_day\n", " \"\"\"\n", "\n", "# Run the query, and return a pandas DataFrame\n", "num_trips_result = client.query(num_trips_query).result().to_dataframe()\n", "num_trips_result.head()" ], "metadata": { "id": "_aHYi_0IR0WG", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "af0addb0-eb5f-4e0d-fe4b-49853affde91" }, "execution_count": 51, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " trip_date num_trips cumulative_trips\n", "0 2015-01-01 181 181\n", "1 2015-01-02 428 609\n", "2 2015-01-03 283 892\n", "3 2015-01-04 206 1098\n", "4 2015-01-05 1186 2284" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trip_datenum_tripscumulative_trips
02015-01-01181181
12015-01-02428609
22015-01-03283892
32015-01-042061098
42015-01-0511862284
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"2015-01-01\",\n{\n 'v': 181,\n 'f': \"181\",\n },\n{\n 'v': 181,\n 'f': \"181\",\n }],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"2015-01-02\",\n{\n 'v': 428,\n 'f': \"428\",\n },\n{\n 'v': 609,\n 'f': \"609\",\n }],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"2015-01-03\",\n{\n 'v': 283,\n 'f': \"283\",\n },\n{\n 'v': 892,\n 'f': \"892\",\n }],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n\"2015-01-04\",\n{\n 'v': 206,\n 'f': \"206\",\n },\n{\n 'v': 1098,\n 'f': \"1098\",\n }],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n\"2015-01-05\",\n{\n 'v': 1186,\n 'f': \"1186\",\n },\n{\n 'v': 2284,\n 'f': \"2284\",\n }]],\n columns: [[\"number\", \"index\"], [\"string\", \"trip_date\"], [\"number\", \"num_trips\"], [\"number\", \"cumulative_trips\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 51 } ] }, { "cell_type": "markdown", "source": [ "The query uses a [common table expression (CTE)](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#with_clause) to first calculate the daily number of trips. Then, we use **SUM()** as an aggregate function.\n", "- Since there is no **PARTITION BY** clause, the entire table is treated as a single partition.\n", "- The **ORDER BY** clause orders the rows by date, where earlier dates appear first. \n", "- By setting the **window frame** clause to `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`, we ensure that all rows up to and including the current date are used to calculate the (cumulative) sum. See https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts#def_window_frame for more details.\n", "\n", "The next query **tracks the stations where each bike began (in `start_station_id`) and ended (in `end_station_id`) the day on October 25, 2015.**" ], "metadata": { "id": "ShH0iValR196" } }, { "cell_type": "code", "source": [ "# Query to track beginning and ending stations on October 25, 2015, for each bike\n", "start_end_query = \"\"\"\n", " SELECT bike_number,\n", " TIME(start_date) AS trip_time,\n", " FIRST_VALUE(start_station_id)\n", " OVER (\n", " PARTITION BY bike_number\n", " ORDER BY start_date\n", " ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING\n", " ) AS first_station_id,\n", " LAST_VALUE(end_station_id)\n", " OVER (\n", " PARTITION BY bike_number\n", " ORDER BY start_date\n", " ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING\n", " ) AS last_station_id,\n", " start_station_id,\n", " end_station_id\n", " FROM `bigquery-public-data.san_francisco.bikeshare_trips`\n", " WHERE DATE(start_date) = '2015-10-25' \n", " \"\"\"\n", "\n", "# Run the query, and return a pandas DataFrame\n", "start_end_result = client.query(start_end_query).result().to_dataframe()\n", "start_end_result.head()" ], "metadata": { "id": "ivG9NWgZR7QY", "colab": { "base_uri": "https://localhost:8080/", "height": 197 }, "outputId": "9428ba76-003e-4e15-b2f4-589e5c23782f" }, "execution_count": 52, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " bike_number trip_time first_station_id last_station_id start_station_id \\\n", "0 25 11:43:00 77 51 77 \n", "1 25 12:14:00 77 51 60 \n", "2 111 14:41:00 69 65 69 \n", "3 403 16:54:00 51 54 51 \n", "4 301 13:36:00 35 34 35 \n", "\n", " end_station_id \n", "0 60 \n", "1 51 \n", "2 65 \n", "3 54 \n", "4 35 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bike_numbertrip_timefirst_station_idlast_station_idstart_station_idend_station_id
02511:43:0077517760
12512:14:0077516051
211114:41:0069656965
340316:54:0051545154
430113:36:0035343535
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/f872b2c2305463fd/data_table.js\";\n\n window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n{\n 'v': 25,\n 'f': \"25\",\n },\n\"11:43:00\",\n{\n 'v': 77,\n 'f': \"77\",\n },\n{\n 'v': 51,\n 'f': \"51\",\n },\n{\n 'v': 77,\n 'f': \"77\",\n },\n{\n 'v': 60,\n 'f': \"60\",\n }],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n{\n 'v': 25,\n 'f': \"25\",\n },\n\"12:14:00\",\n{\n 'v': 77,\n 'f': \"77\",\n },\n{\n 'v': 51,\n 'f': \"51\",\n },\n{\n 'v': 60,\n 'f': \"60\",\n },\n{\n 'v': 51,\n 'f': \"51\",\n }],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n{\n 'v': 111,\n 'f': \"111\",\n },\n\"14:41:00\",\n{\n 'v': 69,\n 'f': \"69\",\n },\n{\n 'v': 65,\n 'f': \"65\",\n },\n{\n 'v': 69,\n 'f': \"69\",\n },\n{\n 'v': 65,\n 'f': \"65\",\n }],\n [{\n 'v': 3,\n 'f': \"3\",\n },\n{\n 'v': 403,\n 'f': \"403\",\n },\n\"16:54:00\",\n{\n 'v': 51,\n 'f': \"51\",\n },\n{\n 'v': 54,\n 'f': \"54\",\n },\n{\n 'v': 51,\n 'f': \"51\",\n },\n{\n 'v': 54,\n 'f': \"54\",\n }],\n [{\n 'v': 4,\n 'f': \"4\",\n },\n{\n 'v': 301,\n 'f': \"301\",\n },\n\"13:36:00\",\n{\n 'v': 35,\n 'f': \"35\",\n },\n{\n 'v': 34,\n 'f': \"34\",\n },\n{\n 'v': 35,\n 'f': \"35\",\n },\n{\n 'v': 35,\n 'f': \"35\",\n }]],\n columns: [[\"number\", \"index\"], [\"number\", \"bike_number\"], [\"string\", \"trip_time\"], [\"number\", \"first_station_id\"], [\"number\", \"last_station_id\"], [\"number\", \"start_station_id\"], [\"number\", \"end_station_id\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n " }, "metadata": {}, "execution_count": 52 } ] }, { "cell_type": "markdown", "source": [ "The query uses both **FIRST_VALUE()** and **LAST_VALUE()** as analytic functions.\n", "- The **PARTITION BY** clause breaks the data into partitions based on the `bike_number` column. Since this column holds unique identifiers for the bikes, this ensures the calculations are performed separately for each bike.\n", "- The **ORDER BY** clause puts the rows within each partition in chronological order.\n", "- Since the **window frame** clause is `ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`, for each row, its entire partition is used to perform the calculation. (_This ensures the calculated values for rows in the same partition are identical._)" ], "metadata": { "id": "iegKJa1nR_rZ" } }, { "cell_type": "markdown", "source": [ "You can check https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction and https://googleapis.dev/python/bigquery/latest/index.html for more details." ], "metadata": { "id": "4XUIvA2v2A30" } }, { "cell_type": "markdown", "source": [ "## Data Wrangling with Pandas" ], "metadata": { "id": "Sd-M02wrbNle" } }, { "cell_type": "markdown", "metadata": { "id": "LU-ma6F6bdkB" }, "source": [ "### `Series` objects\n", "The `pandas` library contains these useful data structures:\n", "* `Series` objects, that we will discuss now. A `Series` object is 1D array, similar to a column in a spreadsheet (with a column name and row labels).\n", "* `DataFrame` objects. This is a 2D table, similar to a spreadsheet (with column names and row labels)." ] }, { "cell_type": "markdown", "metadata": { "id": "q6En2jWCbdkC" }, "source": [ "#### Creating a `Series`\n", "Let's start by creating our first `Series` object!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "Twbix6NpbdkC", "outputId": "5c5ec161-1d04-4701-d00f-099eacdfc3cb", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 2\n", "1 -1\n", "2 3\n", "3 5\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 2 } ], "source": [ "s = pd.Series([2,-1,3,5])\n", "s" ] }, { "cell_type": "markdown", "metadata": { "id": "6ERNyjvMbdkE" }, "source": [ "Arithmetic operations on `Series` are also possible, and they apply *elementwise*, just like for `ndarray`s:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "qCPOAd6tbdkF", "outputId": "592e0e3f-0207-42e9-87f8-b62b22aea267", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 1002\n", "1 1999\n", "2 3003\n", "3 4005\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 3 } ], "source": [ "s + [1000,2000,3000,4000]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "WOQxfgJEbdkF", "outputId": "92e8e5e8-5e42-4b5f-dbe1-bace97c1a97a", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 1002\n", "1 999\n", "2 1003\n", "3 1005\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 4 } ], "source": [ "s + 1000" ] }, { "cell_type": "markdown", "metadata": { "id": "05v47g-TbdkG" }, "source": [ "#### Index labels\n", "Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting at `0`) but you can also set the index labels manually:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "uToGC_H1bdkG", "outputId": "e7592d19-7aa9-4534-a111-b82f6a6c08a1", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "alice 68\n", "bob 83\n", "charles 112\n", "darwin 68\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 5 } ], "source": [ "s2 = pd.Series([68, 83, 112, 68], index=[\"alice\", \"bob\", \"charles\", \"darwin\"])\n", "s2" ] }, { "cell_type": "markdown", "metadata": { "id": "YHll54B1bdkH" }, "source": [ "You can then use the `Series` just like a `dict`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "k_5q4EuqbdkH", "outputId": "cbf7a319-2c09-4b4a-b011-8d658aa4900c", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "83" ] }, "metadata": {}, "execution_count": 6 } ], "source": [ "s2[\"bob\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "mmMlzPbLbdkH" }, "source": [ "You can still access the items by integer location, like in a regular array:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "xhH_OoLQbdkI", "outputId": "6b629478-22e5-48e2-c7f5-6e85a1a40cd3", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "83" ] }, "metadata": {}, "execution_count": 7 } ], "source": [ "s2[1]" ] }, { "cell_type": "markdown", "metadata": { "id": "TFDcqmL2bdkI" }, "source": [ "To make it clear when you are accessing, it is recommended to always use the `loc` attribute when accessing by label, and the `iloc` attribute when accessing by integer location:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "86J9jGtfbdkI", "outputId": "a26d3b9f-2b19-47aa-9bf8-8e4f13a48ab7", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "83" ] }, "metadata": {}, "execution_count": 8 } ], "source": [ "s2.loc[\"bob\"]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "TMjzFBctbdkI", "outputId": "4e049de9-0899-4fef-987b-8677f4af8f50", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "83" ] }, "metadata": {}, "execution_count": 9 } ], "source": [ "s2.iloc[1]" ] }, { "cell_type": "markdown", "metadata": { "id": "tYYk0kN2bdkJ" }, "source": [ "Slicing a `Series` also slices the index labels:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "pGJh6BLRbdkK", "outputId": "38fc2d8a-f54e-49b3-8636-13d44b1ec23c", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "bob 83\n", "charles 112\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 10 } ], "source": [ "s2.iloc[1:3]" ] }, { "cell_type": "markdown", "metadata": { "id": "hzw9bWbBbdkM" }, "source": [ "#### Init from `dict`\n", "You can create a `Series` object from a `dict`. The keys will be used as index labels:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "u-QkmNSSbdkM", "outputId": "2a88806a-f222-405e-8ffe-249fbfebe023", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "alice 68\n", "bob 83\n", "colin 86\n", "darwin 68\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 11 } ], "source": [ "weights = {\"alice\": 68, \"bob\": 83, \"colin\": 86, \"darwin\": 68}\n", "s3 = pd.Series(weights)\n", "s3" ] }, { "cell_type": "markdown", "metadata": { "id": "SM3qtn08bdkM" }, "source": [ "When an operation involves multiple `Series` objects, `pandas` automatically aligns items by matching index labels." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "5AI1dciMbdkM", "outputId": "132b6c95-3423-457a-cd78-2ecec6c6a120", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Index(['alice', 'bob', 'charles', 'darwin'], dtype='object')\n", "Index(['alice', 'bob', 'colin', 'darwin'], dtype='object')\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "alice 136.0\n", "bob 166.0\n", "charles NaN\n", "colin NaN\n", "darwin 136.0\n", "dtype: float64" ] }, "metadata": {}, "execution_count": 12 } ], "source": [ "print(s2.keys())\n", "print(s3.keys())\n", "\n", "s2 + s3" ] }, { "cell_type": "markdown", "metadata": { "id": "6JMZev9LbdkN" }, "source": [ "The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `\"colin\"` is missing from `s2` and `\"charles\"` is missing from `s3`, these items have a `NaN` result value. (ie. Not-a-Number means *missing*).\n", "\n", "Automatic alignment is very handy when working with data that may come from various sources with varying structure and missing items" ] }, { "cell_type": "markdown", "metadata": { "id": "H57kzHKabdkN" }, "source": [ "#### Init with a scalar\n", "You can also initialize a `Series` object using a scalar and a list of index labels: all items will be set to the scalar." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "fz5q8tyGbdkN", "outputId": "a50c358f-3049-464a-bd4d-587fad2b067b", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "life 42\n", "universe 42\n", "everything 42\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 13 } ], "source": [ "meaning = pd.Series(42, [\"life\", \"universe\", \"everything\"])\n", "meaning" ] }, { "cell_type": "markdown", "metadata": { "id": "lQUC1nqObdkO" }, "source": [ "Pandas makes it easy to plot `Series` data using matplotlib (for more details on matplotlib. Just import matplotlib and call the `plot()` method:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true, "id": "IQFkp_wlbdkO", "outputId": "e4cc9a2f-5028-43ac-ead3-e9d8e031bb7f", "colab": { "base_uri": "https://localhost:8080/", "height": 265 } }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3dd3jV5f3/8ec7e0IgCTvMMGSPyBCk4h4o1l131S+14modX7W7tdVWq9b9U+rexIG1ihMHDjRhb8IOM2GEJJB9//7g6JdSRkhOcuecvB7XlStnfMx5HeF65cN97s99m3MOEREJfRG+A4iISHCo0EVEwoQKXUQkTKjQRUTChApdRCRMRPl64bS0NNe1a1dfLy8iEpJyc3MLnXPp+3vOW6F37dqVnJwcXy8vIhKSzGzNgZ7TkIuISJhQoYuIhAkVuohImFChi4iECRW6iEiYUKGLiIQJFbqISJjwNg9d/Cotr+LbVdtYtrmYgZ1SGNolhdioSN+xRKQeVOjNRGV1DXPX7WBGXiFf5W1l1trtVNX831r4cdERDO+WypjMVI7qkUbf9i2IiDCPiUXkcKnQw5RzjmWbS5iRV8iXeYXMXLmV0opqzGBAx5b8z9jujMlMo3e7ZOas3fHDcX95dwkArRKiOSozjTGZaYzukUbn1ATP70hEDkWFHkbW79jNl4Fi/jJvK4Ul5QB0S0vkx0M7MiYzjZHdU0lJiPmP/+74vm05vm9bADbvLOOrFYXMWL6VL/MK+fe8jQBktI5nTGYaR/VI46geqaQmxTbumxORQzJfW9BlZWU5reVSP0W7Kvl6ZWHg7HorqwpLAUhLimF0ZtoPXx1T4uv0851zrCgoDRR8IV+v3EpxWRUAfdu3YEzPPeU+vFtrEmJ0biDSGMws1zmXtd/nVOiho6yymtw1238YHpm/vgjnIDEmkhHdUxkdGCLp1TYJs+CPf1dV1zB/fRFfrdjKjOWF5K7ZTkV1DdGRxtDOrX74BTKoU0uiIjWBSqQhqNBDVHWNY+GGoh8KPGf1dsqraoiKMIZ0TvmhwAdlpBDtoUB3V1STs2bbD/kWbtiJc5AcGxX4BZPKmMw0Mts0zC8YkeboYIWufyc3MRuLdvPR4i18GRjiKNpdCUCfdslcPLILYzLTOLJba5Ji/f/RxcdEcnTPdI7uuWdp5u2lFXy9cusPBf/R4s0AtEmO3TP+nplGn3bJRDRSuUdGGD3bJGm2jjQbOkNvQop2V3L0Xz9hZ1kVHVPiGZ25ZxjlqB5ppCeH3oeQ67bt2jP+nreVr/IK2Vpa0egZjumdzhOXZBETpSEgCQ86Qw8R78zbwM6yKl64cgSjM1NDfpgio3UC57fuzPlHdqamxrFkUzFrt+1qtNdfvrmYv3+4jBtfnc2DFwzRuL6EPRV6EzIlJ5/ebZPDosz3FRFh9O3Qgr4dWjTaa57cvx3xMZHc+e/FJMTM529nD9Twi4Q1FXoTkbelmDnrdvCrU48IuzL36aqju1NSXsUDHy0nKTaK353eV/9/JWyp0JuIKbn5REYYZw7p6DtK2LnhuJ4Ul1XxzxmrSI6L4qYTe/uOJNIgajWoaGYpZpZtZkvMbLGZjdrneTOzB80sz8zmmdnQhokbnqqqa3hj1nrG9W4Tkh9+NnVmxq9PO4ILjszgoU/yePyzFb4jiTSI2p6h/wOY5pw7x8xigH0X9jgF6Bn4GgE8FvgutfD58gIKiss5N6uT7yhhy8z4848HUFJexd3vLSEpNoqLR3bxHUskqA5Z6GbWEhgLXA7gnKsA9p1/NgF4zu2ZA/lN4Iy+vXNuY5DzhqXs3HxaJ8Ywrncb31HCWmSEcf/5g9ldUc1vpi4gKTZKQ1wSVmoz5NINKACeNrPZZjbZzBL3OaYjsG6v+/mBx/6DmU00sxwzyykoKKhz6HCyvbSCjxZt4czBHTVXuhFER0bwyEVDGdktlZumzOWDhZt8RxIJmto0SBQwFHjMOTcEKAVuq8uLOeeecM5lOeey0tPT6/Ijws7UOeupqK7RcEsjiouO5MnLshjQsSXXvjSbGcsLfUcSCYraFHo+kO+cmxm4n82egt/beiBjr/udAo/JIUzJzad/xxYc0b7x5mcLJMVG8cxPj6R7eiL/81wOuWu2+Y4kUm+HLHTn3CZgnZl9P9frOGDRPoe9DVwamO0yEijS+PmhLdqwk4UbdnLOUJ2d+5CSEMPzV46gXcs4Ln/6OxasL/IdSaReajtoex3wopnNAwYDfzGzq83s6sDz7wIrgTzgSeCaoCcNQ9m5+cRERjBhsD6Y8yU9OZYXrhpBcmwUlz71LXlbin1HEqkzLc7lSUVVDSPv+piR3Vvz6EXDfMdp9lYVlnLu418TFWFMuXoUGa215Z40TQdbnEvTKjz5ZMkWtpVWcO6wjEMfLA2uW1oiz185nN2V1Vw0eSabd5b5jiRy2FTonmTn5tMmOZaje6b5jiIBR7RvwTM/PZKtJeVcPHkm2zws9ytSHyp0DwqKy5m+dAs/HtpRS7o2MUM6t2LyZUeydtsuLnvqW3aWVfqOJFJrahMP3pq9nuoap+GWJmpUj1Qeu3goizfu5KpncthdUe07kkitqNAbmXOOKbnrGNI5hcw2Sb7jyAEc26ctD1wwmJw12/jZC7mUV6nUpelToTeyeflFLNtcwjnDNPe8qRs/sAN3nzWQz5cVcOMrc6iqrvEdSeSgVOiNLDs3n9ioCE4f1MF3FKmF847M4Dfj+/Legk387+vzqanxM81XpDa0wUUjKqusZuqc9Zzcvx0t4qJ9x5FaunJMN0rKqrj/o2UkxUby+zP6adcjaZJU6I3ow0Wb2VlWpQ9DQ9D1x2VSUl7Jk1+sIikuiltO6uM7ksh/UaE3oim5+XRoGceoHqm+o8hhMjPuOPUISsqreGT6ChJjo7jmmEzfsUT+gwq9kWwqKmPG8gImjcskUjvPhyQz484zB1BaXs3fpi0lOTaKS0Z19R1L5Acq9Eby+qx8ahya3RLiIiOMv583iF0VVfxm6kISYqI4W3+m0kRolksjcM6RnZvP8G6t6ZK672ZPEmqiIyN4+MKhHNUjlVuy5zJtgXY9kqZBhd4IctdsZ1Vhqc7Ow0hcdCRPXprFoIwUrn95Np8v05aK4p8KvRFk5+aTEBPJaQPa+44iQZQYG8Uzlw+nR5skJj6fw3erteuR+KVCb2C7Kqp4Z95GTh3QnsRYfWQRblomRPPcFcPp0DKeK57+jrwtJb4jSTOmQm9g0xZsoqS8inM13BK20pNjef6qEcRERTDpxVlazEu8UaE3sCk5+XRuncDwbq19R5EG1DElnvvPH8yyLcX8duoC33GkmVKhN6B123bx9cqtnDOsky4VbwbG9krnunGZTMnNZ0rOOt9xpBlSoTeg12flY4bmKTcjNxzfi1HdU/nN1AUs3aQNp6VxqdAbSE3Nnrnno3uk0TEl3nccaSSREcY/fjKYpNhofv5iLqXlVb4jSTOiQm8g36zaSv723Zp73gy1SY7jwZ8MZnVhKXe8OR/ntOSuNA4VegPJzs0nOTaKk/q18x1FPDiqRxq/OL4XU+ds4OVvNZ4ujUOF3gBKyqt4b/4mxg/qQHxMpO844smkcZmM7ZXO7/+1kAXri3zHkWZAhd4A/j1vA7srqzk3S8MtzVlEhHH/eYNonRDDtS/NYmdZpe9IEuZU6A1gSk4+3dMTGZKR4juKeJaaFMtDFw5h3fbd3Pb6PI2nS4NSoQfZqsJSctZs59xhGZp7LgAc2bU1t5zUm3fnb+K5r9f4jiNhrFaFbmarzWy+mc0xs5z9PH+MmRUFnp9jZr8NftTQkJ27jgiDs4Z29B1FmpCJR3fnuD5tuPPfi5i7bofvOBKmDucMfZxzbrBzLusAz38ReH6wc+6PwQgXaqprHK/nrudHvdJp2yLOdxxpQiICG2O0SY5j0kuzKNql8XQJPg25BNGMvEI27SzjHG0CLfuRkhDDwxcOYfPOMm7OnqvxdAm62ha6Az4ws1wzm3iAY0aZ2Vwze8/M+u3vADObaGY5ZpZTUBB+GwJk5+aTkhDN8X3b+I4iTdSQzq247ZQj+HDRZv45Y5XvOBJmalvoY5xzQ4FTgElmNnaf52cBXZxzg4CHgLf290Occ08457Kcc1np6el1Dt0UFe2q5P2Fm5gwqAOxUZp7Lgd2xeiunNSvLXe/t4TcNdt9x5EwUqtCd86tD3zfArwJDN/n+Z3OuZLA7XeBaDNLC3LWJu3teRuoqKrRcIsckpnxt3MG0T4ljmtfmsW20grfkSRMHLLQzSzRzJK/vw2cCCzY55h2FpijZ2bDAz93a/DjNl3ZOevo0y6Z/h1b+I4iIaBlfDSPXjiMrSUV/PK1OdTUaDxd6q82Z+htgRlmNhf4Fvi3c26amV1tZlcHjjkHWBA45kHgAteMPvFZtrmYuflFWvdcDsuATi35zfgj+HRpAY9/vsJ3HAkDh9zk0jm3Ehi0n8cf3+v2w8DDwY0WOrJz84mKMM4cornncnguHtmFmau2ce/7SxnWuRUjuqf6jiQhTNMW66myuoY3Zq1nXJ82pCXF+o4jIcbMuOusAXRJTeS6l2dTWFLuO5KEMBV6PX22tIDCknJtAi11lhwXzSMXDqVodyU3vjKHao2nSx2p0OspOzeftKQYxvXR3HOpu74dWvCHM/oxI6+Qhz/J8x1HQpQKvR62lVbw8ZLNnDm4I9GR+l8p9XP+kRmcNaQjD3y8jC/zCn3HkRCkFqqHt2avp7LacY7WPZcgMDPu/HF/eqQnccMrs9mys8x3JAkxKvR6mJKbz4COLenTTnPPJTgSYqJ47KKhlJZXc93Ls6mqrvEdSUKICr2OFm4oYvHGndqVSIKuZ9tk7jyzPzNXbeP+j5b5jiMhRIVeR1Ny8omJjOCMQR18R5EwdPawTpyflcEj01cwfekW33EkRKjQ66Ciqoapc9ZzQt+2pCTE+I4jYeoPE/rRp10yv3x1Dht27PYdR0KACr0OPl68me27KvVhqDSouOhIHr1oKBVVNVz38mwqNZ4uh6BCr4Ps3HzatohlbM/wWgJYmp7u6UncffZActds5573l/qOI02cCv0wbSku49NlBZw1tBOREVqISxre6YM6cPHIzjzx+Uo+XLTZdxxpwlToh+nNWeuprnGco0v9pRH9+rS+9O/Ygptem8O6bbt8x5EmSoV+GJxzTMnNZ2jnFHqkJ/mOI81IXHQkj1w4FOfg2pdmUVGl8XT5byr0wzA3v4i8LSWcm6VdiaTxdUlN5J5zBzI3v4i/vLvYdxxpglToh2FKzjrioiM4bWB731GkmTq5f3t+Ororz3y1mvfmb/QdR5oYFXotlVVW8/bcDZzcrx0t4qJ9x5Fm7PZTjmBwRgq3Zs9jdWGp7zjShKjQa+n9hZsoLqvScIt4FxMVwcMXDiEiwpj00izKKqt9R5ImQoVeS9m5+XRMiWeUtgiTJqBTqwTuO28QCzfs5E/vLPIdR5oIFXotbNixmxl5hZw9rBMRmnsuTcRxR7TlZ2O78+LMtUyds953HGkCVOi18MasfJyDc4Zq7rk0LTef1JusLq244435rCgo8R1HPFOhH4JzjuzcfEZ0a03n1ATfcUT+Q3RkBA9dOITY6EgmvTiL3RUaT2/OVOiH8N3q7azeuksfhkqT1b5lPPedN4ilm4v53dsLfMcRj1Toh/Dyt2tJjInk1AHtfEcROaBjerdh0jGZvJaTT3Zuvu844okK/SCWbipm6pz1/GR4ZxJionzHETmoG4/vycjurfn1W/NZtrnYdxzxQIV+EH+btoTE2Cgmjcv0HUXkkKIiI3jwgiEkxUZzzYuzKC2v8h1JGpkK/QBmrtzKx0u2cM0xmbRK1K5EEhratIjjwQsGs6KghF+/tQDnnO9I0ohU6PvhnOPuaUto1yKOn47u6juOyGE5KjONG4/rxZuz1/Pqd+t8x5FGVKtCN7PVZjbfzOaYWc5+njcze9DM8sxsnpkNDX7UxvP+wk3MXruDX5zQk7joSN9xRA7btcdmcnTPNH779kIWbdjpO440ksM5Qx/nnBvsnMvaz3OnAD0DXxOBx4IRzoeq6hr+Nm0pmW2SOFsXEkmIioww7j9/MK0Sopn00iyKyyp9R5JGEKwhlwnAc26Pb4AUMwvJNWZfy8lnZWEpt57Um6hIjUhJ6EpLiuXBC4awdtsubntjvsbTm4HaNpYDPjCzXDObuJ/nOwJ7D9blBx77D2Y20cxyzCynoKDg8NM2sF0VVTzw0TKyurTihL5tfccRqbcR3VO56cRe/HveRl74Zo3vONLAalvoY5xzQ9kztDLJzMbW5cWcc08457Kcc1np6el1+REN6ukvV7OluJzbTumDmRbhkvBw9dgejOudzp/eWcz8/CLfcaQB1arQnXPrA9+3AG8Cw/c5ZD2w97XxnQKPhYxtpRU8/ukKTujblqyurX3HEQmaiAjjvvMGk5YUwzUv5VK0W+Pp4eqQhW5miWaW/P1t4ERg3wUj3gYuDcx2GQkUOedCan+shz/Jo7SiiltP6u07ikjQtUqM4aELh7JxRxm3Zs/VeHqYqs0ZeltghpnNBb4F/u2cm2ZmV5vZ1YFj3gVWAnnAk8A1DZK2gazbtovnv1nNucMy6Nk22XcckQYxrEsrbjulD+8v3MxTX672HUcawCEXKHHOrQQG7efxx/e67YBJwY3WeO77cBkRZvzihF6+o4g0qCvHdGPmqm3c9e5ihnROYWjnVr4jSRA1+3l5CzcU8dac9VwxphvtWsb5jiPSoMyMe88ZRLuWcVz30mx27KrwHUmCqNkX+l+nLaVFXDRX/6iH7ygijaJlQjSPXjSUguJybnptLjU1Gk8PF8260L/MK+TzZQVcOy6TlvHRvuOINJqBnVL41WlH8PGSLTzxxUrfcSRImm2h19Q47n5vCR1T4rlkVBffcUQa3aWjunDagPbc8/5Svlu9zXccCYJmW+jvLtjI/PVF/PKEXlqAS5olM+PusweQ0Sqea1+axdaSct+RpJ6aZaFXVtdwz/tL6dMumTOH/NcKBSLNRnJcNI9cNJTtuyq58dU5Gk8Pcc2y0F/+di1rtu7if0/uQ2SELvGX5q1fh5b8/vR+fLG8kEem5/mOI/XQ7Aq9pLyKBz9ezohurTmmd9NbT0bEh58Mz+DMwR24/6NlfLWi0HccqaNmV+iTv1hJYUmFFuAS2YuZ8ecfD6BbWiLXvzyHLcVlviNJHTSrQi8oLufJz1dy6oB2DNEVciL/ITE2ikcvGkZJeSU3vDyHao2nh5xmVegPf7Kcsqoabj5RC3CJ7E/vdsn8aUJ/vl65lX98tMx3HDlMzabQVxeW8uLMtVxwZAbd05N8xxFpss7NyuDcYZ14aHoeny9rehvRyIE1m0K/94OlREdGcMNxPX1HEWny/jihP73aJHPjq3PYVKTx9FDRLAp9Xv4O3pm3kauO7kabFlqAS+RQ4mMieeSioZRVVnPdy7Ooqq7xHUlqIewL3bk9l/i3Toxh4tjuvuOIhIzMNkncddYAvlu9nXs/0Hh6KAj7Qv98eSFfrdjKdcdmkhynBbhEDseEwR25cERnHv9sBdOXbvEdRw4hrAv9+wW4MlrHc+GIzr7jiISk347vS++2ydzxxnxKyqt8x5GDCOtCf3vuBhZv3MnNJ/YmNkoLcInURVx0JHedPYBNO8v4+wdLfceRgwjbQi+vqubeD5bSr0MLTh/YwXcckZA2tHMrLhnZhWe+Ws3cdTt8x5EDCNtCf/GbteRv381tp/QhQgtwidTbLSf1pk1yLLe/MV+zXpqosCz0nWWVPPTJcsZkpnF0Ty3AJRIMyXHR/OGMfizauJOnvlzlO47sR1gW+hOfrWT7rkr+9+Q+vqOIhJWT+rXj+CPacv+Hy1m3bZfvOLKPsCv0LTvLmDxjJacP6sCATi19xxEJK2bGHyf0I8Lg128twDkt4NWUhF2hP/DxcqprHDef2Mt3FJGw1CElnptP6s1nywr417yNvuPIXsKq0FcUlPDqd+u4aEQXuqQm+o4jErYuHdWVQZ1a8sd/LaRoV6XvOBIQVoV+z7SlxEVFcO2xmb6jiIS1yAjjL2cNYPuuSu56b7HvOBIQNoU+a+12pi3cxMSxPUhLivUdRyTs9evQkivHdOOV79bx7aptvuMIYVLozjnufncJaUkxXHV0N99xRJqNG4/vSadW8dz+xjzKq6p9x2n2al3oZhZpZrPN7J39PHe5mRWY2ZzA11XBjXlw05du4dvV27jhuJ4kxkY15kuLNGsJMVHceWZ/VhSU8vinK33HafYO5wz9BuBgg2WvOucGB74m1zNXrVXXOP763lK6piZwwXAtwCXS2I7p3YbTB3Xgkel55G0p8R2nWatVoZtZJ+A0oNGKurbemJXP0s3F3HJSH6Ijw2IESSTk/HZ8X+KiI/jVm/M1N92j2jbgA8CtwMEWcDjbzOaZWbaZZezvADObaGY5ZpZTUFD/vQrLKqu578NlDOrUklMHtKv3zxORuklPjuWOU49g5qptTMnJ9x2n2TpkoZvZeGCLcy73IIf9C+jqnBsIfAg8u7+DnHNPOOeynHNZ6en1X2Plua9Xs7GojP89pQ9mWoBLxKfzsjIY3rU1f353MYUl5b7jNEu1OUMfDZxhZquBV4BjzeyFvQ9wzm11zn3/JzgZGBbUlPtRtKuSR6av4Ee90jmqR1pDv5yIHEJEhPGXs/qzq6KKP72zyHecZumQhe6cu90518k51xW4APjEOXfx3seYWfu97p7BwT88DYpHP8tjZ5kW4BJpSjLbJPPzYzKZOmcDny2r/7CqHJ46f4poZn80szMCd683s4VmNhe4Hrg8GOEOZMOO3Tz95Wp+PLgjfTu0aMiXEpHDdM0xPeiensiv35rP7grNTW9Mh1XozrlPnXPjA7d/65x7O3D7dudcP+fcIOfcOOfckoYI+70HPloGDn5xghbgEmlq4qIj+cuPB7Bu224e+HiZ7zjNSsjN81u2uZjs3HwuGdWFjNYJvuOIyH6M7J7KeVmdmPzFKhZt2Ok7TrMRcoW+ZWc5mW2SmDROC3CJNGV3nHoEKfHR3P7mfKprNDe9MYRcoY/pmcb7N46ldWKM7ygichApCTH89vS+zF23g+e/Xu07TrMQcoUOaM65SIg4Y1AHju6Zxj3vL2Vj0W7fccJeSBa6iIQGM+PPZw6g2jl+N3Wh7zhhT4UuIg2qc2oCNxzXiw8WbWbagk2+44Q1FbqINLirju5Gn3bJ/P7thRSXacu6hqJCF5EGFx0Zwd1nD2RzcRn3vr/Ud5ywpUIXkUYxOCOFS0d24blv1jB77XbfccKSCl1EGs3NJ/WmbXIct78xn8rqg63GLXWhQheRRpMcF80fJvRjyaZiJn+xynecsKNCF5FGdVK/dpzYty3/+HgZa7fu8h0nrKjQRaTR/WFCP6IiIvjVW9qyLphU6CLS6Nq3jOfmE3vxxfJCps7Z4DtO2FChi4gXl4zqyqCMFP70ziJ27KrwHScsqNBFxIvICOPuswawY3clf3m3wTc5axZU6CLizRHtW3DV0d14LSefr1ds9R0n5KnQRcSrG4/rRUbreH715nzKKrVlXX2o0EXEq/iYSO48cwArC0t59NMVvuOENBW6iHj3o17pTBjcgcc+zSNvS7HvOCFLhS4iTcJvxvclISaK29+YT422rKsTFbqINAlpSbHccWofvlu9nVdz1vmOE5JU6CLSZJyXlcGIbq25693FfLhos64iPUwqdBFpMsyMv549kNSkWP7nuRzOfuwrvlmp6Yy1pUIXkSala1oiH/xiLHedNYD1O3ZzwRPfcOlT37JgfZHvaE2e+fonTVZWlsvJyfHy2iISGsoqq3n2q9U8+ukKinZXMn5ge246sTfd0hJ9R/PGzHKdc1n7fU6FLiJNXdHuSp78fCX/nLGKiuoazsvqxPXH9aR9y3jf0RrdwQq91kMuZhZpZrPN7J39PBdrZq+aWZ6ZzTSzrnWPKyLyn1rGR3PzSb35/NZxXDKyC9m5+Rxzz6fc9e5itpdqYa/vHc4Y+g3AgVbQuRLY7pzLBO4H/lrfYCIi+0pPjuX3Z/Tjk5uO4bQB7Xnii5WM/dt0Hvp4OaXlVb7jeVerQjezTsBpwOQDHDIBeDZwOxs4zsys/vFERP5bRusE7jt/MNNuGMvIHqn8/cNl/Oie6Tzz5SrKq5rvejC1PUN/ALgVONCurh2BdQDOuSqgCEjd9yAzm2hmOWaWU1BQUIe4IiL/p3e7ZJ68NIvXf34UPdKT+P2/FnHc3z/j9dx8qpvh1aaHLHQzGw9scc7l1vfFnHNPOOeynHNZ6enp9f1xIiIADOvSilcmjuTZK4bTMj6am6bM5ZR/fM4HCzc1q4uTanOGPho4w8xWA68Ax5rZC/scsx7IADCzKKAloKsBRKTRmBk/6pXOv64dw8MXDqGy2jHx+VzOeuyrZrPW+iEL3Tl3u3Ouk3OuK3AB8Ilz7uJ9DnsbuCxw+5zAMc3n16KINBkREcb4gR1+uDhp444yfvJk87g4qc5XiprZH83sjMDdfwKpZpYH/BK4LRjhRETqKjoygp8M78yntxzDHaf2YV7+DsY/NINJL81iZUGJ73gNQhcWiUizsLPs/y5OKq8K3YuTdKWoiEhAQXE5j0zP48WZazAzLj+qKz//UQ9aJcb4jlYrQblSVEQkHOx9cdL4ge158ouVnPKPL1i3bZfvaPWmQheRZimjdQL3nTeYqZNGs7uymosmz2TzzjLfsepFhS4izdrATik8e8VwtpaUc/HkmWwL4bVhVOgi0uwNzkhh8mVHsnbbLi576lt2llX6jlQnKnQREWBUj1Qeu3goizfu5MpnvmN3ReitCaNCFxEJOLZPWx64YDC5a7Yz8fmckFvoS4UuIrKX8QM7cPdZA/lieSE3vDyHquoDrUnY9KjQRUT2cd6RGfxmfF+mLdzEra/PoyZEVm6M8h1ARKQpunJMN0rLq7jvw2UkxUbxhzP60dS3eVChi4gcwHXHZlJSXsUTn68kOS6KW07q4zvSQanQRUQOwMy4/ZQ+FJdV8cj0FSTGRnHNMZm+Yx2QCl1E5CDMjDvP7M+uiir+Nm0pybFRXB6yN4UAAAaDSURBVDKqq+9Y+6VCFxE5hMgI495zB1FaXs1vpi4kISaKs4d18h3rv2iWi4hILURHRvDwhUM4qkcqt2TPZdqCjb4j/RcVuohILcVFR/LkpVkMykjhupdn89myprXZvQpdROQwJMZG8czlw8lsk8zPns/hu9XbfEf6gQpdROQwtUyI5vkrh9MhJZ4rnv6O+flNY69SFbqISB2kJcXywpUjaBEfzaVPzWT55mLfkVToIiJ11SElnhevGkFUZAQXTZ7J2q1+dz1SoYuI1EPXtEReuHIEFdU1XDj5GzYV+dv1SIUuIlJPvdsl8+xPh7NjVyUXTf6GrSXlXnKo0EVEgmBQRgqTL8sif/tuLvW065EKXUQkSEZ2T+XxS4axbHMxVzz9Hbsqqhr19VXoIiJBNK53Gx44fwiz1m7nZ8/nNuquRyp0EZEgO21ge+4+e8+uR9e9NLvRdj1SoYuINIDzsjL43el9+WDRZm7Jbpxdj7TaoohIA/np6D27Ht37wTISYyP504T+Dbrr0SEL3czigM+B2MDx2c653+1zzOXAPcD6wEMPO+cmBzeqiEjomTQuk+LyKv7fZytJjI3itpP7NFip1+YMvRw41jlXYmbRwAwze885980+x73qnLs2+BFFREKXmXHbyX0oKdtT6i3iopk0rmF2PTpkoTvnHFASuBsd+AqNLbBFRJoAM+NPE/pTWl7FPe8vJTEmkstHdwv669TqQ1EzizSzOcAW4EPn3Mz9HHa2mc0zs2wzyzjAz5loZjlmllNQ0LTWERYRaUgREcY95w7i9EEd6Jya0CCvYXtOwGt5sFkK8CZwnXNuwV6PpwIlzrlyM/sZcL5z7tiD/aysrCyXk5NTx9giIs2TmeU657L299xhTVt0zu0ApgMn7/P4Vufc94sXTAaG1SWoiIjU3SEL3czSA2fmmFk8cAKwZJ9j2u919wxgcTBDiojIodVmlkt74Fkzi2TPL4DXnHPvmNkfgRzn3NvA9WZ2BlAFbAMub6jAIiKyf4c1hh5MGkMXETl8QRtDFxGRpkuFLiISJlToIiJhQoUuIhImvH0oamYFwJo6/udpQGEQ4zQ14fz+9N5CVzi/v1B6b12cc+n7e8JbodeHmeUc6FPecBDO70/vLXSF8/sLl/emIRcRkTChQhcRCROhWuhP+A7QwML5/em9ha5wfn9h8d5CcgxdRET+W6ieoYuIyD5U6CIiYSLkCt3MTjazpWaWZ2a3+c4TLGaWYWbTzWyRmS00sxt8Zwq2wM5Xs83sHd9Zgs3MUgK7dS0xs8VmNsp3pmAxs18E/k4uMLOXAxvHhywze8rMtpjZ3pv0tDazD81seeB7K58Z6yqkCj2whO8jwClAX+AnZtbXb6qgqQJucs71BUYCk8LovX3vBsJ3rfx/ANOcc32AQYTJ+zSzjsD1QJZzrj8QCVzgN1W9PcM+m/QAtwEfO+d6Ah8H7oeckCp0YDiQ55xb6ZyrAF4BJnjOFBTOuY3OuVmB28XsKYSOflMFj5l1Ak5jz45WYcXMWgJjgX8COOcqArt7hYsoIN7MooAEYIPnPPXinPucPfs27G0C8Gzg9rPAmY0aKkhCrdA7Auv2up9PGJXe98ysKzAE2N9m3KHqAeBWoMZ3kAbQDSgAng4MKU02s0TfoYLBObceuBdYC2wEipxzH/hN1SDaOuc2Bm5vAtr6DFNXoVboYc/MkoDXgRudczt95wkGMxsPbHHO5frO0kCigKHAY865IUApIfpP9n0FxpInsOeXVgcg0cwu9puqYbk9c7lDcj53qBX6eiBjr/udAo+FBTOLZk+Zv+ice8N3niAaDZxhZqvZM0x2rJm94DdSUOUD+c657/9Flc2egg8HxwOrnHMFzrlK4A3gKM+ZGsLm7/dGDnzf4jlPnYRaoX8H9DSzbmYWw54PZ972nCkozMzYMwa72Dl3n+88weScu90518k515U9f2afOOfC5izPObcJWGdmvQMPHQcs8hgpmNYCI80sIfB39DjC5APffbwNXBa4fRkw1WOWOqvNJtFNhnOuysyuBd5nz6ftTznnFnqOFSyjgUuA+WY2J/DYHc65dz1mktq7DngxcKKxEvip5zxB4ZybaWbZwCz2zMSaTYhfJm9mLwPHAGlmlg/8DrgbeM3MrmTPst7n+UtYd7r0X0QkTITakIuIiByACl1EJEyo0EVEwoQKXUQkTKjQRUTChApdRCRMqNBFRMLE/wc1wYwtu/phRAAAAABJRU5ErkJggg==\n" }, "metadata": { "needs_background": "light" } } ], "source": [ "temperatures = [4.4,5.1,6.1,6.2,6.1,6.1,5.7,5.2,4.7,4.1,3.9,3.5]\n", "s4 = pd.Series(temperatures, name=\"Temperature\")\n", "s4.plot()\n", "plt.show()" ] }, { "cell_type": "markdown", "source": [ "You can easily convert it to Numpy array by dicarding the index. " ], "metadata": { "id": "jiCCsIfv4LoD" } }, { "cell_type": "code", "source": [ "s4.to_numpy()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8FykEnJJ4HQl", "outputId": "65fbfab9-2483-4fad-a78b-e7fea6a22bf6" }, "execution_count": 15, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([4.4, 5.1, 6.1, 6.2, 6.1, 6.1, 5.7, 5.2, 4.7, 4.1, 3.9, 3.5])" ] }, "metadata": {}, "execution_count": 15 } ] }, { "cell_type": "markdown", "metadata": { "id": "JLC98XwVbdkO" }, "source": [ "There are *many* options for plotting your data. It is not necessary to list them all here: if you need a particular type of plot (histograms, pie charts, etc.), just look for it in the excellent [Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html) section of pandas' documentation, and look at the example code." ] }, { "cell_type": "markdown", "metadata": { "id": "jwz5rnEsbdkO" }, "source": [ "### Handling time\n", "Many datasets have timestamps, and pandas is awesome at manipulating such data:\n", "* it can represent periods (such as 2016Q3) and frequencies (such as \"monthly\"),\n", "* it can convert periods to actual timestamps, and *vice versa*,\n", "* it can resample data and aggregate values any way you like,\n", "* it can handle timezones.\n", "\n", "#### Time range\n", "Let's start by creating a time series using `pd.date_range()`. This returns a `DatetimeIndex` containing one datetime per hour for 12 hours starting on April 23th 2022 at 5:30pm." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "becHbUssbdkO", "outputId": "3e5fc357-5efe-4e83-cf5e-a03c598724d0", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "DatetimeIndex(['2022-04-23 17:30:00', '2022-04-23 18:30:00',\n", " '2022-04-23 19:30:00', '2022-04-23 20:30:00',\n", " '2022-04-23 21:30:00', '2022-04-23 22:30:00',\n", " '2022-04-23 23:30:00', '2022-04-24 00:30:00',\n", " '2022-04-24 01:30:00', '2022-04-24 02:30:00',\n", " '2022-04-24 03:30:00', '2022-04-24 04:30:00'],\n", " dtype='datetime64[ns]', freq='H')" ] }, "metadata": {}, "execution_count": 16 } ], "source": [ "dates = pd.date_range('2022/04/23 5:30pm', periods=12, freq='H')\n", "dates" ] }, { "cell_type": "markdown", "metadata": { "id": "Zhslg2XVbdkP" }, "source": [ "This `DatetimeIndex` may be used as an index in a `Series`:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "ojYFJeizbdkP", "outputId": "c6e38d74-043b-457e-b83e-a727554dd093", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "2022-04-23 17:30:00 4.4\n", "2022-04-23 18:30:00 5.1\n", "2022-04-23 19:30:00 6.1\n", "2022-04-23 20:30:00 6.2\n", "2022-04-23 21:30:00 6.1\n", "2022-04-23 22:30:00 6.1\n", "2022-04-23 23:30:00 5.7\n", "2022-04-24 00:30:00 5.2\n", "2022-04-24 01:30:00 4.7\n", "2022-04-24 02:30:00 4.1\n", "2022-04-24 03:30:00 3.9\n", "2022-04-24 04:30:00 3.5\n", "Freq: H, dtype: float64" ] }, "metadata": {}, "execution_count": 17 } ], "source": [ "temp_series = pd.Series(temperatures, dates)\n", "temp_series" ] }, { "cell_type": "markdown", "metadata": { "id": "O_vVwZqWbdkP" }, "source": [ "Let's plot this series:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "gOa_KJ_JbdkP", "outputId": "f0da59fd-db16-43d8-8896-84fd8d7c1b2c", "colab": { "base_uri": "https://localhost:8080/", "height": 360 } }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAFXCAYAAACLPASQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAcG0lEQVR4nO3de5CldX3n8feXixEYHEVIYwZxCN7Wdbww7SVFspnBqCheUrWuxl01rKbGrKuSWncFq/ZiKjdjFUZ3K5oQ76trx3hlxesqA2tWwB5ER0UEEZQpBS9cHCUi8t0/nqelaU9Pn26e53d+5znvV9Wp7vM8p8/n+f1O97fP+T3P83siM5Ek1eugSW+AJOnALNSSVDkLtSRVzkItSZWzUEtS5Q7p40mPPvro3Lp167p/7sc//jFHHHFE9xs04SzzzDNvdvI2mrVnz57vZ+YxI1dmZue37du350acf/75G/q52rPMM8+82cnbaBawmKvUVIc+JKlyFmpJqpyFWpIqZ6GWpMpZqCWpchZqSaqchVqSKmehlqTKWaglqXK9nEKuMraedd6q616x7XZOX2X9Na85ra9NktQD31FLUuUs1JJUubGGPiLi3sCbgYcDCbwwMz/X54ZNo6EPRZRu39D7UxrXuGPUbwA+npnPioh7AIf3uE2SpGXWLNQRsRn4F8DpAJl5G3Bbv5slSVoSzTSoB3hAxKOAc4CvAo8E9gBnZOaPVzxuF7ALYG5ubvvCwsK6N2b//v1s2rRp3T+3EX1k7d1386rr5g6D628dvW7bls3mVZB3ICV/N82b7ryNZu3cuXNPZs6PWjdOoZ4HLgJOzsyLI+INwC2Z+V9W+5n5+flcXFxc94bu3r2bHTt2rPvnNqKPrLXGVM/eO/oDTF9juOZ1p+TvpnnTnbfRrIhYtVCPc9THdcB1mXlxe/99wEnr3gpJ0oasWagz87vAtyPiIe2iJ9AMg0iSChj3qI+XAe9uj/i4Gvi3/W2SJGm5sQp1Zl4GjBw7kST1yzMTJalyFmpJqpyFWpIqZ6GWpMo5H7XUchIo1cp31JJUOQu1JFXOQi1JlbNQS1LlLNSSVDkLtSRVzkItSZWzUEtS5SzUklQ5C7UkVc5CLUmVs1BLUuUs1JJUOQu1JFXOQi1JlbNQS1LlvHCANCFeqEDj8h21JFXOQi1JlRv00IcfLSUNwViFOiKuAX4E/By4PTPn+9woSdKd1vOOemdmfr+3LZEkjeQYtSRVLjJz7QdFfBO4EUjgbzPznBGP2QXsApibm9u+sLCw7o3Zv38/mzZtWvfPrWbvvptXXTd3GFx/6+h127ZsNs+8weUdSNd/e7Oct9GsnTt37lltWHncQr0lM/dFxK8CnwJelpkXrvb4+fn5XFxcXPeG7t69mx07dqz751az1s7Es/eOHvnZ6M5E88yrOe9Auv7bm+W8jWZFxKqFeqyhj8zc1369Afgg8Nh1b4UkaUPWLNQRcUREHLn0PfAk4Mt9b5gkqTHOUR9zwAcjYunx/yszP97rVkmSfmHNQp2ZVwOPLLAtkqQRPDxPkipnoZakylmoJalyFmpJqtygZ8+TdCdnk5xevqOWpMpZqCWpchZqSaqchVqSKmehlqTKWaglqXIWakmqnIVakipnoZakyhU/M9GzoyRpfXxHLUmVs1BLUuUs1JJUOQu1JFXOQi1JlbNQS1LlLNSSVDkLtSRVzkItSZUb+8zEiDgYWAT2ZebT+tskSUPgWcjdWc876jOAy/vaEEnSaGMV6og4DjgNeHO/myNJWikyc+0HRbwP+AvgSOA/jhr6iIhdwC6Aubm57QsLCyOfa+++m1fNmTsMrr919LptWzavuZ2TzDLPPPMmm3cg+/fvZ9OmTZ0/b5dZO3fu3JOZ86PWrVmoI+JpwFMz8yURsYNVCvVy8/Pzubi4OHLdWuNWZ+8dPWy+kXGrklnmmWfeZPMOZPfu3ezYsaPz5+0yKyJWLdTjDH2cDDwjIq4BFoBTIuJd694KSdKGrHnUR2a+CngVwLJ31M/rebskaV02cpTJtBxh4nHUklS5dV3hJTN3A7t72RJJ0ki+o5akylmoJalyFmpJqpyFWpIqZ6GWpMpZqCWpchZqSaqchVqSKmehlqTKWaglqXIWakmq3Lrm+pAklb8epO+oJalyFmpJqpyFWpIqZ6GWpMpZqCWpchZqSaqchVqSKmehlqTKWaglqXIWakmqnIVakipnoZakyq1ZqCPinhFxSUR8MSK+EhF/XGLDJEmNcWbP+ylwSmbuj4hDgc9GxMcy86Ket02SxBiFOjMT2N/ePbS9ZZ8bJUm6UzR1eI0HRRwM7AEeCPx1Zp454jG7gF0Ac3Nz2xcWFkY+1959N6+aM3cYXH/r6HXbtmxeczsnmWWeeeZNX15Nbdu5c+eezJwftW6sQv2LB0fcG/gg8LLM/PJqj5ufn8/FxcWR69aacPvsvaPf5G9kwu2SWeaZZ9705dXUtohYtVCv66iPzLwJOB84dT0/J0nauHGO+jimfSdNRBwGPBH4Wt8bJklqjHPUx/2Ad7Tj1AcB783Mj/S7WZKkJeMc9fEl4NEFtkWSNIJnJkpS5SzUklQ5C7UkVc5CLUmVs1BLUuUs1JJUOQu1JFXOQi1JlbNQS1LlLNSSVDkLtSRVzkItSZWzUEtS5SzUklQ5C7UkVc5CLUmVs1BLUuUs1JJUOQu1JFXOQi1JlbNQS1LlLNSSVDkLtSRVzkItSZWzUEtS5dYs1BFx/4g4PyK+GhFfiYgzSmyYJKlxyBiPuR14RWZeGhFHAnsi4lOZ+dWet02SxBjvqDPzO5l5afv9j4DLgS19b5gkqRGZOf6DI7YCFwIPz8xbVqzbBewCmJub276wsDDyOfbuu3nV5587DK6/dfS6bVs2j72dk8gyzzzzpi+vprbt3LlzT2bOj1o3dqGOiE3ABcCfZeYHDvTY+fn5XFxcHLlu61nnrfpzr9h2O2fvHT0ac81rThtrOyeVZZ555k1fXk1ti4hVC/VYR31ExKHA+4F3r1WkJUndGueojwDeAlyema/rf5MkScuN8476ZOD5wCkRcVl7e2rP2yVJaq15eF5mfhaIAtsiSRrBMxMlqXIWakmqnIVakipnoZakylmoJalyFmpJqpyFWpIqZ6GWpMpZqCWpchZqSaqchVqSKmehlqTKWaglqXIWakmqnIVakipnoZakylmoJalyFmpJqpyFWpIqZ6GWpMpZqCWpchZqSaqchVqSKmehlqTKrVmoI+KtEXFDRHy5xAZJku5qnHfUbwdO7Xk7JEmrWLNQZ+aFwA8LbIskaYTIzLUfFLEV+EhmPvwAj9kF7AKYm5vbvrCwMPJxe/fdvGrO3GFw/a2j123bsnnN7ZxklnnmmTd9eTW1befOnXsyc37Uus4K9XLz8/O5uLg4ct3Ws85b9edese12zt57yMh117zmtHGiJ5ZlnnnmTV9eTW2LiFULtUd9SFLlLNSSVLlxDs97D/A54CERcV1EvKj/zZIkLRk9kLJMZj63xIZIkkZz6EOSKmehlqTKWaglqXIWakmqnIVakipnoZakylmoJalyFmpJqpyFWpIqZ6GWpMpZqCWpchZqSaqchVqSKmehlqTKWaglqXIWakmqnIVakipnoZakylmoJalyFmpJqpyFWpIqZ6GWpMpZqCWpchZqSaqchVqSKjdWoY6IUyPiioi4KiLO6nujJEl3WrNQR8TBwF8DTwEeBjw3Ih7W94ZJkhrjvKN+LHBVZl6dmbcBC8Az+90sSdKSyMwDPyDiWcCpmfkH7f3nA4/LzJeueNwuYFd79yHAFRvYnqOB72/g5zaiZJZ55pk3O3kbzXpAZh4zasUhd2977pSZ5wDn3J3niIjFzJzvaJOqyTLPPPNmJ6+PrHGGPvYB9192/7h2mSSpgHEK9eeBB0XECRFxD+D3gHP73SxJ0pI1hz4y8/aIeCnwCeBg4K2Z+ZWetuduDZ1UnGWeeebNTl7nWWvuTJQkTZZnJkpS5SzUklQ5C7UkVa6z46jXKyI2A6cCW9pF+4BPZOZNPWQFzRmWy7MuyZ4G6Eu2rc0bevsG3Z9t5tzyvMy8vses0r8vQ+/P3rMmsjMxIl4A/Dfgk9x5TPZxwBOBP87Md3aY9STgjcCVK7IeCLwkMz/ZVVabV6xtbd7Q2zf0/nwU8DfA5hV5N7V5l3acV7p9g+3Poq9dZha/0Zxefu8Ry+8DfL3jrMuBrSOWnwBcPs1tm5H2Db0/L6OZkmHl8scDXxxA+wbbnyWzJjVGHcCot/J3tOu6dAhw3Yjl+4BDO86Csm2D4bdv6P15RGZevHJhZl4EHNFDXun2Dbk/i2VNaoz6z4BLI+KTwLfbZcfTfJz9k46z3gp8PiIWlmXdn+YMy7d0nAVl2wbDb9/Q+/NjEXEe8M4VeS8APt5DXun2Dbk/i2VN7ISXiLgP8GR+eQfRjT1kPQx4xoqsczPzq11ntXnF2tbmDb19Q+/Pp9BMHbwy76M95ZVu32D7s1TWRM9MLLlnts07CiAzf9hnTptVtG1t5mDbN/T+nITS7Rt6f/ZpUkd9LN9beh3NWGNfe2aPB14LnALc3GbdC/gMcFZmXtNVVptXrG1t3tDbN/T+3Ay8iuZd2RzNePwNwIeB12THhyBOoH2D7c+ir13Xe11r21sKfA54DnDwsmUH04yRXTTNbZuR9g29Pz8BnAkcu2zZscBZwCcH0L7B9mfRrK47aswGXnmAdVcVzFp13TS0zfYNoj+v2Mi6KWrfYPuzZNakjvoouWd2T0S8EXjHiqzfB77QcRaU34s/9PYNvT+vjYhXAu/Idty9HY8/fVl+l0q3b8j9WSxrkkd9lNlb2lzs4EUrsq4D/jfwlsz8aZd5bWbJvc6Dbl/pvNL92R7RchZ3Hee8nubiHH+ZHe94m0D7BtufRbMmVaglSeOZ+Ox57dXLV73fcdbTDnS/h7xibWuff+jtG3p/nnSg+z3klW7fYPuz76yJF2p++TTgPk4LXvKYNe53rWTbYPjtG3p//rs17netdPuG3J+9Zjn0IUmVm+R81E8Gfpe77iD6cGZ2vic/Ih7K6J1Rl3ed1eYVa1ubN/T2Db0/S8+3Xbp9g+3PUlkTGfqIiNcDZwAX0Jy19Nr2+5dHxBs6zjoTWKD5mHxJewvgPRFxVpdZbV6xtrV5Q2/f0PvzBcClwA7g8Pa2k+awthf0kFe6fYPtz6KvXdcHnI95oPjIeYRpXsBOD4IHvg4cOmL5PbrOKt22WWnfwPuz9HzbxX9fhtqfJbMmtTPxnyJi1I6ExwD/1HHWHcCvjVh+v3Zd10q2DYbfvqH3Z+n5tku3b8j9WSxrUmPUpwNviogjuXNS8fvTTNpyesdZfwR8OiKu5K7zGT8QeGnHWVC2bTD89pXOK92fpefbLt2+IfdnsaxJT3N6LHeduvK7PeUcxC9fXPPzmfnzPvLazCJta7MG3b7SeaX7M8rPt126fYPtz1JZkzyF/FiAzPxuRBwD/BbwtexhMvH2F4XMvKM9pfXhwDXZ07y4JdvW5g29fYPuzxH5R/WZNYHfl0H3Z5GsrgfzxxyEfzHwTeAamgPDL6a5LM8VwIs6zvpdmvPvv0NziNDFwKdpPkY/fZrbNiPtG3p/nkxzAdivAI8DPgV8g+aj9G8MoH2D7c+iWV131JgN3EtzKMt9gf2087nS7C29rOOsL9DMEXsCcAvwkHb5A4DFaW7bjLRv6P15CbAN+A3g+8BvtstPAv5xAO0bbH+WzJrUzsSfZeZPgJ9ExDeyHW/MzBsjovOxmKXnj4hvZeYV7bJrlz6Sdaxo29rnHnL7ht6fh2bm3jbve5n52Tbv0og4rIe80u0bcn8Wy5rU4XkZEUuXij9taWFE3JMetmnZL8QLly07mOZYzq4VbVv73ENu39D7c3kbXrViXR95pds35P4sl9X1R48xPzIcDxwyYvkW4Hc6znoMcM8Ry7cCz5vmts1I+4ben88ADh+x/ETglQNo32D7s2SWkzJJUuUmPs1pRJxzoPsdZ736QPd7yCvWtvb5X32g+z3klW7f0Puz9Hzbrz7Q/QHklZzrvtesiRdq4G/XuN+lPWvc71rJtsHw2zf0/iw933bp9g25P3vNcuhDkio3qWlOPxARz4uITQWyfj0i3hoRfxoRmyLi7yLiyxHxDxGxtYe8gyLihRFxXkR8MSIujYiFiNjRdVabd0hEvDgiPh4RX2pvH4uIP1x2tEQRfQxFRMTBbfv+JCJOXrHuP/eQd3hEvDIi/lNE3DMiTo+IcyPitX39vkbEkyPiTW3Oue33p/aRtcZ2/NeenvfJEfGilX9vEfHC0T9xt7IiIp4dEf+q/f4JEfHfI+IlfR1+uCL/M7087yTeUUfEPuBzwCnA/wHeA5yXmbf1kHVh+/ybgecBbwPeCzwJ+DeZeUrHeW8DrqVp17NoDvL/v8CZNJPd/4+O894D3AS8gzsnLToO+H3gqMx8Tsd5R622CvhiZh7Xcd6baU54uQR4PnBBZv6Hdt2lmdnpteki4r00Z5YdBjyE5syzv6fZw39sZj6/47zXAw8G3sldX78X0EwDekaXeWtsy7cy8/iOn/PPgd+kmbf56cDrl/4Genr93gj8Ks3hcbcAv0JzVfDTgOu77M+I+NLKRTSv5dKx4o/oKqvTQ2PWcVjLF9qv96L54/so8D2aIvqkPrLa77+12roO87604v5F7ddfAS7vIW/VeW8PtO5u5P0cuJrmtO6l29L92/rsT5rZHs8BPtD2Zx+v32Xt1wC+y51vZmLla9vn60d/823fssrtR8DtPeTtpT28Erh3+7f+V+39Pl6/ve3XQ4EfAPdY9rvT6etH8w/gXcBDac603ErzT/4BwAO6zJrYCS8AmXlLZv7PzHxq29iLga6v+nBHRDw4mjmND4+IeYCIeCBwcMdZAD+LiBPbjJOA2wAy86eMnrv27vph+zHvF69lO/zyHKCP2deuBnZk5gnLbr+emSfQzOnQtV+cOJCZt2fmLuAy4DNAb0Nn2fwlfrT9unS/j9ev9HzbNwEPysx7rbgdSTMfR9cOyczbAbK5PNXTgXtFxD/QzwkvS1k/o5mhb+nv73Y6nv86M58BvJ/mzcMjM/MamjNpr83Ma7vM6vS/2Tr+E11YMOsJNB9FLqf5CPZ+4CrgBuCZPeSdAnwLuJLmXebj2uXHAK/tIW8rzUfz79FcTePrbdv+Hjihh7x/3/5Sjlr3sh7y3gWcOmL5H7R/FF3nvRnYNGL5icBne8g7ieYNyleBT7a3y4GLgO095P0p8NhV1v1lD3kfAX57le24o4e8j63y+h0LXNJ1XvvcRwCvAz4MXNdHxkwe9RERRwM3Zn/z4QZw38z8fh/Pf4Dc+wJk5g9K5s6KiIjs6Q8mCs/vXUq0c15k5q0j1m3JzH2FtuMI4IjMvKHHjEfSzJr3N50/d22FOiKemJmfGlpWn3kRcS/gmMz8xorlj8jMlTs8zKsvr/RVyM2bsqwaTnhZ6S0DzeolLyKeDXwNeH9EfGXFeOfbzas+r/RVyM2bxqxJvKOOiHNXWwWckplHTGPWhPIuA56Smd+JiMfSHOb1qsz8YER8ITMfbV7VeVfQ7Me4acXy+wAXZ+aDzaszr2TWpOaj/i2aY5r3r1geNNdWm9asSeQdnJnfAcjMSyJiJ/CRiLg//RylYF63Sl+F3LwpzJpUob4I+ElmXrByRftfalqzJpH3o4g4cWk8tX0nuAP4EPDPzas+r/RVyM2bwqzqdiZqfdo9zT/OzKtWLD8UeHZmvtu8evPa5y59FXLzpizLQi1JlavxqA9J0jIWakmqnIVaqkw7R4x5U5jXV9ak5qN+aDRzJp8XESdGxNsj4qaIuCQi/tm0Zpln3gbyTlpx2w6cGxGP7uOP3rzpzOp88pBxbsCFNLNoPZdm7ubfoznu8OnAp6c1yzzzNpB3B/D/gPOX3W5tv37GvHrzimZ13VFjNnD5HNFXrVh36bRmmWfeBvL+JXABzdmQS8u+2XWOedOdNakx6uXzQL9uxbqu56gtmWWeeeuSme+nufrIk6K5PNzx9HMGpHlTnNXLf7Ux/hO9mNFzxj6Q5lI9U5llnnl3M/vRNB+bb+gzx7zpy/KEF6kiERHAkZl5i3nTlddn1sQOz4uyVyYulmWeeXcnLxu3mDcdecWySnwEGfEx4S9o9q6/HvgGyy7hRPc7E4tlmWfeBvL+3LzpzCua1XVHjdnAYlcmLpllnnnmzU5eyaxJDX2UvDJx6asgm2eeebORVyxrUoX6GxHx20t3MvPnmfkimquFd332V8ks88wzb3byimVN6lJcxa5MXDLLPPPMm528klkTeUedmbeubFxEvLpd1+kLVzLLPPPMm528klk1zZ73jIFmmWeeebOT10tWTYW6jwtd1pBlnnnmzU5eL1nVnJkYEQdl5h1DyzLPPPNmJ6+vrJreUX9toFnmmWfe7OT1kjWpoz5+xJ2zTC19VDgc+AmQmXmvacwyzzzzZievZNak3lG/DfgQ8KDMPDIzjwS+1X7f6QtXOMs888ybnbxyWdnxKZzj3oDtwGeAl9P8w7h6CFnmmWfe7OSVyprYGHVm7gF+p717AXDPIWSZZ555s5NXKquKoz4i4n7AozPzo0PKMs8882Ynr8+sQ7p+wnFFxEOBZwJb2kX7IuKbmXn5NGeZZ555s5NXKmsiQx8RcSawQLOn9JL2FsB7IuKsac0yzzzzZievaNv6HNQ/wAD814FDRyy/B3DltGaZZ555s5NXMmtSOxPvAH5txPL7teumNcs888ybnbxiWZMao/4j4NMRcSXw7XbZ8TRXen7pFGeZZ555s5NXLGtiR31ExEHAY1k2CA98PjN/Ps1Z5pln3uzkFcvqeozoboz37BpilnnmmTc7eX1l1TQp0x8ONMs888ybnbxesmoq1FM/Z6x55pk383mDn4/6uMy8bmhZ5pln3uzk9ZU1sXfUEfHQiHhCRGwCWGpcRJw6zVnmmWfe7OQVyyo5qL9swP3lNJdU/xBwDfDMZesundYs88wzb3byimZ13VFjNnAvsKn9fiuwCJzR3v/CtGaZZ555s5NXMmtSJ7wclJn7ATLzmojYAbwvIh5A94PxJbPMM8+82ckrljWpMerrI+JRS3faxj4NOBrYNsVZ5pln3uzkFcua1DUTjwNuz8zvjlh3cmb+4zRmmWeeebOTVzRrEoVakjS+Sc1HvS0iLoqIb0fEORFxn2XrLpnWLPPMM2928kpmTWqM+k3Aq2nGcb4OfDYiTmzXHTrFWeaZZ97s5JXL6vrwmDEPa/niivs7gSuBx9P9sY7Fsswzz7zZySua1XVHjdtAYPOKZY9oG/mDac0yzzzzZievaFbXHTVmA/818PgRy48H/m5as8wzz7zZySuZ5VEfklS5SR31sTkiXhMRX4uIH0bEDyLi8nbZvac1yzzzzJudvJJZkzrq473AjcCOzDwqM+9LMxB/Y7tuWrPMM8+82ckrl9X1GNGYYztXbGRd7VnmmWfe7OSVzJrUO+prI+KVETG3tCAi5iLiTO68mu80Zplnnnmzk1csa1KF+jnAfYEL2rGdHwK7gaOAZ09xlnnmmTc7ecWyPOpDkipXw6W4jlixvM/L5fSeZZ555s1OXrGsrgfzxxyEH+Tlcswzz7zZySua1XVHjdnAQV4uxzzzzJudvJJZXoqre+aZZ95s5HkprinNMs8882Ynz0txTWOWeeaZNzt5RbMmUaglSeOb2OF5kqTxWKglqXIWakmqnIVakir3/wHhmwVn4PFWWAAAAABJRU5ErkJggg==\n" }, "metadata": { "needs_background": "light" } } ], "source": [ "temp_series.plot(kind=\"bar\")\n", "\n", "plt.grid(True)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "mh0jD10ZbdkU" }, "source": [ "### Periods\n", "The `pd.period_range()` function returns a `PeriodIndex` instead of a `DatetimeIndex`. For example, let's get all quarters in 2016 and 2017:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "MuvcVUi7bdkU", "outputId": "9bed4376-f3ce-4d25-d04c-1f7a9c2c1878", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "PeriodIndex(['2021Q1', '2021Q2', '2021Q3', '2021Q4', '2022Q1', '2022Q2',\n", " '2022Q3', '2022Q4'],\n", " dtype='period[Q-DEC]')" ] }, "metadata": {}, "execution_count": 19 } ], "source": [ "quarters = pd.period_range('2021Q1', periods=8, freq='Q')\n", "quarters" ] }, { "cell_type": "markdown", "metadata": { "id": "s0U-6hwvbdkU" }, "source": [ "Adding a number `N` to a `PeriodIndex` shifts the periods by `N` times the `PeriodIndex`'s frequency:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "id": "msf61zGDbdkV", "outputId": "110697a5-be13-45f3-82ea-ad078c52e07c", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "PeriodIndex(['2021Q4', '2022Q1', '2022Q2', '2022Q3', '2022Q4', '2023Q1',\n", " '2023Q2', '2023Q3'],\n", " dtype='period[Q-DEC]')" ] }, "metadata": {}, "execution_count": 20 } ], "source": [ "quarters + 3" ] }, { "cell_type": "markdown", "metadata": { "id": "svb-1SlLbdkW" }, "source": [ "Pandas also provides many other time-related functions that we recommend you check out in the [documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html)" ] }, { "cell_type": "markdown", "metadata": { "id": "_3B8znu4bdkX" }, "source": [ "### `DataFrame` objects\n", "A DataFrame object represents a spreadsheet, with cell values, column names and row index labels. You can define expressions to compute columns based on other columns, create pivot-tables, group rows, draw graphs, etc. You can see `DataFrame`s as dictionaries of `Series`." ] }, { "cell_type": "markdown", "source": [ "\n", "#### Creating a `DataFrame`\n", "You can create a DataFrame by passing a dictionary of `Series` objects:" ], "metadata": { "id": "ge2vV_g9nKT3" } }, { "cell_type": "code", "execution_count": 69, "metadata": { "id": "YYuXxk5IbdkX", "outputId": "00fad6dd-7685-45ea-81db-71e635dd284f", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight birthyear children hobby\n", "alice 68 1985 NaN Biking\n", "bob 83 1984 3.0 Dancing\n", "charles 112 1992 0.0 NaN" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightbirthyearchildrenhobby
alice681985NaNBiking
bob8319843.0Dancing
charles11219920.0NaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 69 } ], "source": [ "people_dict = {\n", " \"weight\": pd.Series([68, 83, 112], index=[\"alice\", \"bob\", \"charles\"]),\n", " \"birthyear\": pd.Series([1984, 1985, 1992], index=[\"bob\", \"alice\", \"charles\"], name=\"year\"),\n", " \"children\": pd.Series([0, 3], index=[\"charles\", \"bob\"]),\n", " \"hobby\": pd.Series([\"Biking\", \"Dancing\"], index=[\"alice\", \"bob\"]),\n", "}\n", "people = pd.DataFrame(people_dict)\n", "people" ] }, { "cell_type": "markdown", "metadata": { "id": "q5PqT_JjbdkX" }, "source": [ "A few things to note:\n", "* the `Series` were automatically aligned based on their index,\n", "* missing values are represented as `NaN`,\n", "* `Series` names are ignored (the name `\"year\"` was dropped),\n", "* `DataFrame`s are displayed nicely in Jupyter notebooks!" ] }, { "cell_type": "markdown", "metadata": { "id": "Z7-ImAbLbdkX" }, "source": [ "You can access columns pretty much as you would expect. They are returned as `Series` objects:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "id": "Wzc_2C8fbdkX", "outputId": "4cc1bd71-c1e5-4456-b996-912beaae64cd", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "alice 1985\n", "bob 1984\n", "charles 1992\n", "Name: birthyear, dtype: int64" ] }, "metadata": {}, "execution_count": 70 } ], "source": [ "people[\"birthyear\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "EorAvWrRbdkX" }, "source": [ "You can also get multiple columns at once:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "id": "1MlqI9oLbdkX", "outputId": "011f6209-92cc-4680-fb59-714a0ee4e4cf", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " birthyear hobby\n", "alice 1985 Biking\n", "bob 1984 Dancing\n", "charles 1992 NaN" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthyearhobby
alice1985Biking
bob1984Dancing
charles1992NaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 71 } ], "source": [ "people[[\"birthyear\", \"hobby\"]]" ] }, { "cell_type": "markdown", "metadata": { "id": "BR_u72S8bdkY" }, "source": [ "Another convenient way to create a `DataFrame` is to pass all the values to the constructor as an `ndarray`, or a list of lists, and specify the column names and row index labels separately:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "id": "7W8qDSb0bdkY", "outputId": "379196e3-c6aa-4ad8-8eee-f9a047905e9f", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " birthyear children hobby weight\n", "alice 1985 NaN Biking 68\n", "bob 1984 3.0 Dancing 83\n", "charles 1992 0.0 NaN 112" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthyearchildrenhobbyweight
alice1985NaNBiking68
bob19843.0Dancing83
charles19920.0NaN112
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 72 } ], "source": [ "values = [\n", " [1985, np.nan, \"Biking\", 68],\n", " [1984, 3, \"Dancing\", 83],\n", " [1992, 0, np.nan, 112]\n", " ]\n", "d3 = pd.DataFrame(\n", " values,\n", " columns=[\"birthyear\", \"children\", \"hobby\", \"weight\"],\n", " index=[\"alice\", \"bob\", \"charles\"]\n", " )\n", "d3" ] }, { "cell_type": "markdown", "metadata": { "id": "kV6Iyx6KbdkY" }, "source": [ "To specify missing values, you can either use `np.nan` or NumPy's masked arrays:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "id": "7eq0bxHWbdkY", "outputId": "baa21e0a-d0d9-4c4b-e789-ff145155aa2b", "colab": { "base_uri": "https://localhost:8080/", "height": 216 } }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. \n", "Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " birthyear children hobby weight\n", "alice 1985 NaN Biking 68\n", "bob 1984 3 Dancing 83\n", "charles 1992 0 NaN 112" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthyearchildrenhobbyweight
alice1985NaNBiking68
bob19843Dancing83
charles19920NaN112
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 73 } ], "source": [ "masked_array = np.ma.asarray(values, dtype=np.object)\n", "masked_array[(0, 2), (1, 2)] = np.ma.masked\n", "d3 = pd.DataFrame(\n", " masked_array,\n", " columns=[\"birthyear\", \"children\", \"hobby\", \"weight\"],\n", " index=[\"alice\", \"bob\", \"charles\"]\n", " )\n", "d3" ] }, { "cell_type": "markdown", "source": [ "You can also create multi-index datafram as follows:" ], "metadata": { "id": "1hA9uEAiqUlg" } }, { "cell_type": "code", "source": [ "df = pd.DataFrame(\n", " {\"a\" : [4 ,5, 6],\n", " \"b\" : [7, 8, 9],\n", " \"c\" : [10, 11, 12]},\n", "index = pd.MultiIndex.from_tuples(\n", " [('d',1),('d',2),('e',2)], names=['n','v'])\n", ")\n", "df" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "xsBbZelxn8tC", "outputId": "951bab2e-0546-4d81-e33c-ec1cc995a215" }, "execution_count": 74, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " a b c\n", "n v \n", "d 1 4 7 10\n", " 2 5 8 11\n", "e 2 6 9 12" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
nv
d14710
25811
e26912
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 74 } ] }, { "cell_type": "markdown", "source": [ "If all columns are tuples of the same size, then they are understood as a multi-index. The same goes for row index labels. For example:" ], "metadata": { "id": "ZJG34ozirdgh" } }, { "cell_type": "code", "source": [ "d5 = pd.DataFrame(\n", " {\n", " (\"public\", \"birthyear\"):\n", " {(\"Paris\",\"alice\"):1985, (\"Paris\",\"bob\"): 1984, (\"London\",\"charles\"): 1992},\n", " (\"public\", \"hobby\"):\n", " {(\"Paris\",\"alice\"):\"Biking\", (\"Paris\",\"bob\"): \"Dancing\"},\n", " (\"private\", \"weight\"):\n", " {(\"Paris\",\"alice\"):68, (\"Paris\",\"bob\"): 83, (\"London\",\"charles\"): 112},\n", " (\"private\", \"children\"):\n", " {(\"Paris\", \"alice\"):np.nan, (\"Paris\",\"bob\"): 3, (\"London\",\"charles\"): 0}\n", " }\n", ")\n", "d5" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "sge2O2qvrDop", "outputId": "a1d196e1-05d4-4676-f82b-8a172fbaf331" }, "execution_count": 75, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " public private \n", " birthyear hobby weight children\n", "Paris alice 1985 Biking 68 NaN\n", " bob 1984 Dancing 83 3.0\n", "London charles 1992 NaN 112 0.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
publicprivate
birthyearhobbyweightchildren
Parisalice1985Biking68NaN
bob1984Dancing833.0
Londoncharles1992NaN1120.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 75 } ] }, { "cell_type": "markdown", "source": [ "You can now get a DataFrame containing all the \"public\" columns very simply:" ], "metadata": { "id": "EeN5brxLrmHy" } }, { "cell_type": "code", "source": [ "d5[\"public\"]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "m4yGqoUOrlnZ", "outputId": "98c6824f-bae8-40a0-f790-0188ad4144cb" }, "execution_count": 76, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " birthyear hobby\n", "Paris alice 1985 Biking\n", " bob 1984 Dancing\n", "London charles 1992 NaN" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
birthyearhobby
Parisalice1985Biking
bob1984Dancing
Londoncharles1992NaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 76 } ] }, { "cell_type": "markdown", "metadata": { "id": "IT41nGg7bdkc" }, "source": [ "It is noted that most methods return modified copies in pandas." ] }, { "cell_type": "markdown", "metadata": { "id": "HHUt1_d1bdkc" }, "source": [ "#### Subsets - Accessing rows\n", "Let's go back to the `people` `DataFrame`:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "id": "75UxPZPAbdkc", "outputId": "a04a8e89-6897-437b-fa44-8339b8312c3d", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight birthyear children hobby\n", "alice 68 1985 NaN Biking\n", "bob 83 1984 3.0 Dancing\n", "charles 112 1992 0.0 NaN" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightbirthyearchildrenhobby
alice681985NaNBiking
bob8319843.0Dancing
charles11219920.0NaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 77 } ], "source": [ "people" ] }, { "cell_type": "markdown", "metadata": { "id": "86z3de-Ibdkc" }, "source": [ "**The `loc` attribute lets you access rows instead of columns.** The result is a `Series` object in which the `DataFrame`'s column names are mapped to row index labels:" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "id": "2ii7IFnIbdkc", "outputId": "ff1e27cb-8f0a-4287-aefc-043155dbe148", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "weight 112\n", "birthyear 1992\n", "children 0.0\n", "hobby NaN\n", "Name: charles, dtype: object" ] }, "metadata": {}, "execution_count": 78 } ], "source": [ "people.loc[\"charles\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "M9n8cJsDbdkd" }, "source": [ "You can also access rows by integer location using the `iloc` attribute:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "id": "u2T-r9f2bdkd", "outputId": "97197019-38cd-47ae-da80-ee5aedbd6fca", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "weight 112\n", "birthyear 1992\n", "children 0.0\n", "hobby NaN\n", "Name: charles, dtype: object" ] }, "metadata": {}, "execution_count": 79 } ], "source": [ "people.iloc[2]" ] }, { "cell_type": "markdown", "metadata": { "id": "h9wxoGC2bdkd" }, "source": [ "You can also get a slice of rows, and this returns a `DataFrame` object:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "id": "PUAGAfWmbdkd", "outputId": "aa7df8c9-05f4-4c0f-ce2b-d5a079004993", "colab": { "base_uri": "https://localhost:8080/", "height": 112 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight birthyear children hobby\n", "bob 83 1984 3.0 Dancing\n", "charles 112 1992 0.0 NaN" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightbirthyearchildrenhobby
bob8319843.0Dancing
charles11219920.0NaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 80 } ], "source": [ "people.iloc[1:3]" ] }, { "cell_type": "markdown", "metadata": { "id": "agVv1ZEKbdkd" }, "source": [ "Finally, you can pass a boolean array to get the matching rows. This is most useful when combined with boolean expressions:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "id": "Uh00Rgg1bdkd", "outputId": "0b5abc4e-0f18-408f-bdc5-3bbbdbc7534c", "colab": { "base_uri": "https://localhost:8080/", "height": 112 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight birthyear children hobby\n", "alice 68 1985 NaN Biking\n", "bob 83 1984 3.0 Dancing" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightbirthyearchildrenhobby
alice681985NaNBiking
bob8319843.0Dancing
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 81 } ], "source": [ "people[people[\"birthyear\"] < 1990]" ] }, { "cell_type": "markdown", "source": [ "You can also accessing columns by specifiying the second axis:" ], "metadata": { "id": "KVvkXiAksSd6" } }, { "cell_type": "code", "source": [ "people.iloc[:,2]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "i4uSRaCfsaGp", "outputId": "fd5f66c2-6341-4b3d-993d-3ae570415c39" }, "execution_count": 82, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "alice NaN\n", "bob 3.0\n", "charles 0.0\n", "Name: children, dtype: float64" ] }, "metadata": {}, "execution_count": 82 } ] }, { "cell_type": "markdown", "metadata": { "id": "KuXwobq6bdke" }, "source": [ "#### Adding and removing columns\n", "You can generally treat `DataFrame` objects like dictionaries of `Series`, so the following work fine:" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "id": "9n9f6-N_bdke", "outputId": "000fb941-5b79-4b15-dfd4-303559a1b9d0", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight birthyear children hobby\n", "alice 68 1985 NaN Biking\n", "bob 83 1984 3.0 Dancing\n", "charles 112 1992 0.0 NaN" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightbirthyearchildrenhobby
alice681985NaNBiking
bob8319843.0Dancing
charles11219920.0NaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 83 } ], "source": [ "people" ] }, { "cell_type": "code", "source": [ "people[\"age\"] = 2018 - people[\"birthyear\"] # adds a new column \"age\"\n", "people[\"over 30\"] = people[\"age\"] > 30 # adds another column \"over 30\"\n", "birthyears = people.pop(\"birthyear\")\n", "people.drop(columns=['children'], inplace=True) # drop a column inplace\n", "people" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "Y-qN7BgDtlh9", "outputId": "9bf41a39-00b6-49d1-fc11-cb1ccbb788eb" }, "execution_count": 90, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight hobby age over 30\n", "alice 68 Biking 33 True\n", "bob 83 Dancing 34 True\n", "charles 112 NaN 26 False" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weighthobbyageover 30
alice68Biking33True
bob83Dancing34True
charles112NaN26False
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 90 } ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "id": "CmdxWxvqbdke", "outputId": "fc05e249-cc43-4d9c-f594-3ce9a7d73752", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "alice 1985\n", "bob 1984\n", "charles 1992\n", "Name: birthyear, dtype: int64" ] }, "metadata": {}, "execution_count": 91 } ], "source": [ "birthyears" ] }, { "cell_type": "markdown", "metadata": { "id": "1_VwIo7cbdke" }, "source": [ "When you add a new column, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "id": "4xPE-3XQbdke", "outputId": "c06cbfba-1c16-43f2-d6d1-d05ec66827e2", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight hobby age over 30 pets\n", "alice 68 Biking 33 True NaN\n", "bob 83 Dancing 34 True 0.0\n", "charles 112 NaN 26 False 5.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weighthobbyageover 30pets
alice68Biking33TrueNaN
bob83Dancing34True0.0
charles112NaN26False5.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 92 } ], "source": [ "people[\"pets\"] = pd.Series({\"bob\": 0, \"charles\": 5, \"eugene\":1}) # alice is missing, eugene is ignored\n", "people" ] }, { "cell_type": "markdown", "metadata": { "id": "qmOZF02Ebdkf" }, "source": [ "When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the `insert()` method:" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "id": "GEIn8gb1bdkf", "outputId": "dc73568e-b163-4831-ff8e-dec3cdf9eec3", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight height hobby age over 30 pets\n", "alice 68 172 Biking 33 True NaN\n", "bob 83 181 Dancing 34 True 0.0\n", "charles 112 185 NaN 26 False 5.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheighthobbyageover 30pets
alice68172Biking33TrueNaN
bob83181Dancing34True0.0
charles112185NaN26False5.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 93 } ], "source": [ "people.insert(1, \"height\", [172, 181, 185])\n", "people" ] }, { "cell_type": "markdown", "metadata": { "id": "YAQ4_fffbdkf" }, "source": [ "You can also create new columns by calling the `assign()` method. Note that this returns a new `DataFrame` object, the original is not modified" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "id": "qEDE3MLTbdkf", "outputId": "46f069d4-6c3d-4906-e7ef-b164b635d223", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight height hobby age over 30 pets bmi has_pets\n", "alice 68 172 Biking 33 True NaN 22.985398 False\n", "bob 83 181 Dancing 34 True 0.0 25.335002 False\n", "charles 112 185 NaN 26 False 5.0 32.724617 True" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheighthobbyageover 30petsbmihas_pets
alice68172Biking33TrueNaN22.985398False
bob83181Dancing34True0.025.335002False
charles112185NaN26False5.032.724617True
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 103 } ], "source": [ "p2 = people.assign(\n", " bmi = people[\"weight\"] / (people[\"height\"] / 100) ** 2,\n", " has_pets = people[\"pets\"] > 0\n", ")\n", "p2" ] }, { "cell_type": "markdown", "source": [ "You can also rename the column name:" ], "metadata": { "id": "li0DlR9hvHde" } }, { "cell_type": "code", "source": [ "p2.rename(columns={'bmi':'body_mass_index'})" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "WuV0yEpEuJD9", "outputId": "a0246610-cc9b-4441-e5ea-392b49b9035a" }, "execution_count": 104, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight height hobby age over 30 pets body_mass_index \\\n", "alice 68 172 Biking 33 True NaN 22.985398 \n", "bob 83 181 Dancing 34 True 0.0 25.335002 \n", "charles 112 185 NaN 26 False 5.0 32.724617 \n", "\n", " has_pets \n", "alice False \n", "bob False \n", "charles True " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheighthobbyageover 30petsbody_mass_indexhas_pets
alice68172Biking33TrueNaN22.985398False
bob83181Dancing34True0.025.335002False
charles112185NaN26False5.032.724617True
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 104 } ] }, { "cell_type": "markdown", "metadata": { "id": "I2Oe3W7zbdkg" }, "source": [ "#### Evaluating an expression\n", "A great feature supported by pandas is expression evaluation. This relies on the `numexpr` library which must be installed." ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "id": "_bo42tAMbdkg", "outputId": "888705bc-1db3-4c48-ab9e-4ebbc7c8de81", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "alice False\n", "bob True\n", "charles True\n", "dtype: bool" ] }, "metadata": {}, "execution_count": 105 } ], "source": [ "people.eval(\"weight / (height/100) ** 2 > 25\")" ] }, { "cell_type": "markdown", "metadata": { "id": "BHvHS0IQbdkh" }, "source": [ "Assignment expressions are also supported. Let's set `inplace=True` to directly modify the `DataFrame` rather than getting a modified copy:" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "id": "6Jo-4YJNbdkh", "outputId": "98a48e15-2675-43af-da81-3a7a40399317", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight height hobby age over 30 pets body_mass_index\n", "alice 68 172 Biking 33 True NaN 22.985398\n", "bob 83 181 Dancing 34 True 0.0 25.335002\n", "charles 112 185 NaN 26 False 5.0 32.724617" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheighthobbyageover 30petsbody_mass_index
alice68172Biking33TrueNaN22.985398
bob83181Dancing34True0.025.335002
charles112185NaN26False5.032.724617
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 106 } ], "source": [ "people.eval(\"body_mass_index = weight / (height/100) ** 2\", inplace=True)\n", "people" ] }, { "cell_type": "markdown", "metadata": { "id": "0EhB5ch3bdkh" }, "source": [ "You can use a local or global variable in an expression by prefixing it with `'@'`:" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "id": "Df6YIkMRbdkh", "outputId": "8e2ff479-7929-43fc-f21d-4140bb5608da", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight height hobby age over 30 pets body_mass_index \\\n", "alice 68 172 Biking 33 True NaN 22.985398 \n", "bob 83 181 Dancing 34 True 0.0 25.335002 \n", "charles 112 185 NaN 26 False 5.0 32.724617 \n", "\n", " overweight \n", "alice False \n", "bob False \n", "charles True " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheighthobbyageover 30petsbody_mass_indexoverweight
alice68172Biking33TrueNaN22.985398False
bob83181Dancing34True0.025.335002False
charles112185NaN26False5.032.724617True
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 107 } ], "source": [ "overweight_threshold = 30\n", "people.eval(\"overweight = body_mass_index > @overweight_threshold\", inplace=True)\n", "people" ] }, { "cell_type": "markdown", "metadata": { "id": "8_UK9SF5bdkh" }, "source": [ "#### Querying a `DataFrame`\n", "The `query()` method lets you **filter a `DataFrame` based on a query expression**:" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "id": "hiVf_7cJbdkh", "outputId": "59659ea4-0a63-470a-e5db-4081ccfbbd01", "colab": { "base_uri": "https://localhost:8080/", "height": 81 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight height hobby age over 30 pets body_mass_index overweight\n", "bob 83 181 Dancing 34 True 0.0 25.335002 False" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheighthobbyageover 30petsbody_mass_indexoverweight
bob83181Dancing34True0.025.335002False
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 108 } ], "source": [ "people.query(\"age > 30 and pets == 0\")" ] }, { "cell_type": "markdown", "metadata": { "id": "zQYCm9izbdkh" }, "source": [ "#### Sorting a `DataFrame`\n", "You can sort a `DataFrame` by calling its `sort_index` method. By default it sorts the rows by their index label, in ascending order, but let's reverse the order:" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "id": "vTUlVF6ibdki", "outputId": "1e746bfe-56f2-400f-e72b-4c11fcfbcf63", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " weight height hobby age over 30 pets body_mass_index \\\n", "charles 112 185 NaN 26 False 5.0 32.724617 \n", "bob 83 181 Dancing 34 True 0.0 25.335002 \n", "alice 68 172 Biking 33 True NaN 22.985398 \n", "\n", " overweight \n", "charles True \n", "bob False \n", "alice False " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheighthobbyageover 30petsbody_mass_indexoverweight
charles112185NaN26False5.032.724617True
bob83181Dancing34True0.025.335002False
alice68172Biking33TrueNaN22.985398False
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 109 } ], "source": [ "people.sort_index(ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "hyUelvGXbdki" }, "source": [ "Note that `sort_index` returned a sorted *copy* of the `DataFrame`. To modify `people` directly, we can set the `inplace` argument to `True`. Also, we can sort the columns instead of the rows by setting `axis=1`:" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "id": "GTIgiGLTbdki", "outputId": "fea737f4-6f23-4df1-a7c0-2dac5842d163", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " age body_mass_index height hobby over 30 overweight pets \\\n", "alice 33 22.985398 172 Biking True False NaN \n", "bob 34 25.335002 181 Dancing True False 0.0 \n", "charles 26 32.724617 185 NaN False True 5.0 \n", "\n", " weight \n", "alice 68 \n", "bob 83 \n", "charles 112 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agebody_mass_indexheighthobbyover 30overweightpetsweight
alice3322.985398172BikingTrueFalseNaN68
bob3425.335002181DancingTrueFalse0.083
charles2632.724617185NaNFalseTrue5.0112
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 110 } ], "source": [ "people.sort_index(axis=1, inplace=True)\n", "people" ] }, { "cell_type": "markdown", "metadata": { "id": "2-4p9J5Dbdki" }, "source": [ "To sort the `DataFrame` by the values instead of the labels, we can use `sort_values` and specify the column to sort by:" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "id": "JHukwfIIbdki", "outputId": "57a23819-651b-49b6-b646-a6580bc37627", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " age body_mass_index height hobby over 30 overweight pets \\\n", "charles 26 32.724617 185 NaN False True 5.0 \n", "alice 33 22.985398 172 Biking True False NaN \n", "bob 34 25.335002 181 Dancing True False 0.0 \n", "\n", " weight \n", "charles 112 \n", "alice 68 \n", "bob 83 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agebody_mass_indexheighthobbyover 30overweightpetsweight
charles2632.724617185NaNFalseTrue5.0112
alice3322.985398172BikingTrueFalseNaN68
bob3425.335002181DancingTrueFalse0.083
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 111 } ], "source": [ "people.sort_values(by=\"age\", inplace=True)\n", "people" ] }, { "cell_type": "markdown", "metadata": { "id": "OBVCkgSGbdki" }, "source": [ "#### Plotting a `DataFrame`\n", "Just like for `Series`, pandas makes it easy to draw nice graphs based on a `DataFrame`.\n", "\n", "For example, it is trivial to create a line plot from a `DataFrame`'s data by calling its `plot` method:" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "id": "7gv0BV_ebdki", "outputId": "15c3bd59-5925-4811-f428-6320e47430b2", "colab": { "base_uri": "https://localhost:8080/", "height": 280 } }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEHCAYAAABV4gY/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3deXzV1Z3/8dcHshGysQSBLASVRZaAEJFFXKtVx4p23DsWV6bK1FGn07F2HtVpbYtK7cMZH1N/WpA6Y3VUbHW0zrSOWhUqGJQdF1SWsG8JS8h+fn+cb3IvMSF7bvLN+/l45MG95/u9954vyjvnnu9ZzDmHiIiES69YV0BERNqfwl1EJIQU7iIiIaRwFxEJIYW7iEgIxcW6AgADBw50eXl5sa6GiEi3smLFir3OucyGjnWJcM/Ly6OwsDDW1RAR6VbMbHNjx9QtIyISQgp3EZEQUriLiISQwl1EJIQU7iIiIaRwFxEJIYW7iEgIdYlx7iIiYVdWWc2B0gqKSys5UFpBSWklB0orGTU4lcnD+rX75yncRURaoLyqui6Yi0srOFBaScnRiuB58PhIJcVHfZDXhnl5VU2D7zfnzBMV7iIi7aWyuqYujIujwrq41AfzgdLKIMRrQ7qC4qOVlFZUN/qe8b2NjOQE+iXHk9Engdz+yeRnx5ORnEBGUNYvOZ705Hj6BWX9khM65PoU7iLSrVXXOEqO1mtFH6mk+GgkrA+UVlByNDqoKzlcXtXoe/buZWT0ifeBnJzA0IwkThmS5kM7ORLW/ZITSO8TT7++CWT0iSc5oTdm1olX3ziFu4h0CTU1jkNlVT6Aj0b3Sx/bcvYt6oq6lvbBssZDupdBep9IGGemJDJyUOoxLeeMZB/Mtc/Tk+NJTYzrMiHdWgp3EWlXzjkOlVcdE8y1LedjbyZWBK1rH9IlRyupOc6WzmlJcXUt5IzkBPIG9q1rOde1omtDOwjr1KQ4evXq3iHdWgp3EWmQc47SimrfWj5Sv1sj0k8duZlY219dSfVxUjolMS5oMfsAzsro89VWdN940oP+6YwgwHv30JBurSbD3cwWApcAu51z44KyicDjQBJQBdzunFtu/nvMo8DFQClwg3Puw46qvDSusrqGT3YeYnVRCauLipmYk8E1U3JjXS2JkcaG4UVGdNS7gXjUP66obniEB0ByQu+6VnRGcjyjB6cFLWd/47A2rKP7qdP7xBPfW9NrOkNzWu6LgMeAp6PKHgL+xTn3upldHDw/G7gIGBH8nA78KvhTOlBNjeOLvUdYXVTM6qISVhUVs377wbqhV2lJcQxKTYxxLaU9tPcwPIDEuF5RLed4TspMqWs5ZwRhHd2Krh3tkRjXuxOvXFqqyXB3zr1jZnn1i4G04HE6sD14PAt42jnngPfNLMPMhjjndrRTfXs85xzbio/WhfjqrSWs3VbCoeDOf5/43ozLSuNvpg4jPzudCdkZDBuQ3O1vDoVNRw7Dq+1vzmnmMLykeIV0GLW2z/1O4H/NbD5+CYPpQXkWsDXqvKKg7CvhbmZzgDkAubnqLmjM3sPlrC4qZtVW372yZlsJew9XAP4f8+jBaVw6cSgTsjPIz0nn5MwU4vS1t9NoGJ50Va0N99uAu5xzi83sKmAB8LWWvIFz7gngCYCCgoLj3CPvOQ6VVbJmW0ldP/mqrSVsKz4KgBmcnJnCWSMHMSEnnfzsDEYPTlWrq500dxjeMUPyjrR9GF56n2Nb0WEZhiex19pwnw38ffD4BeDXweNtQE7UedlBmdRTVlnN+h0HWb010k/+xd4juODXXE7/PkzMzWD29GHkZ2cwLiudlEQNbmpKQ8PwvtKKbuUwvEh/sx+GF30zUcPwpKtpbVpsB84C3gbOBT4Lyl8B/s7MnsPfSC1RfztUVdfw6a7DvjUetMo/2XmIqiBNBqYkMiE7nUsnZNW1yvv37Zgpyd2FhuGJtE1zhkI+ix8JM9DMioD7gFuBR80sDigj6DsH/oAfBrkRPxTyxg6oc5dWU+PYtO9I5IZnUQnrtpdQVhkZuZKfncGtZ57IhGwf5EPSk0L9Nbwlw/Cibya2ZRjesTcNIyM/NAxPeormjJa5tpFDkxs41wFz21qp7sI5x46SsmNa5KuLSjgU9MMmxfdi7NB0rp2Sy4TsDCbkZDCsf3K3/are2DC8Y1rRbRiGl94nnhMHptS7afjVYXhpfeJ1r0GkCerEbYH9Ryrqhh/WBvrew+UAxPUyRg1O5ZL8oXUt8pEndM2RK5XVNXUjPDQMTyScFO6NOFxexZqo1viqomKKDkRGrpw4sC9njhhIfnY6+TkZjBmS1ulBFctheJEZiD6kNQxPpGtRuOP7hDfsOHhMP/nnew7XjVzJyujDhJz0uolB47PSSU2Kb7fPb2oY3lduJjZjGJ4ZdX3S6X38MLwRg1Ijrei+GoYnEmY9Ltyrqmv4bPdXR65UVteOXEkgPzuDS/KH+IlB2ekMSGne1P3oYXi1rWYNwxORWAh1uDvn2LSv9JgZnuu2H+Rope87Tk2MY3x2OjefEYxcyclgaHoSQN0wvB0lZXy881DMhuGlJcV1yX57EenaQhXuO0vKWLm1uK6ffHVR8Ve6LganJTE+O53c/sn0S46n5Ggln+85zIebD7R6GN6owanH3EzUMDwRibVuHe7lVdU8tWQThZsOsLqomN2Hypt8zc6DZew8WMbyL/eTENeLflFrdGgYnoiERbcO9893H+GRP36Kw5GRnMCIQSlRLeeGh+HV3kzM6JNAnwSFtIiEU7cO9zFD01j3468T18s0wkNEJEq3DndA/dgiIg1QMoqIhJDCXUQkhBTuIiIhpHAXEQkhhbuISAgp3EVEQkjhLiISQgp3EZEQUriLiISQwl1EJIQU7iIiIaRwFxEJIYW7iEgIKdxFREKoyXA3s4VmttvM1tYr/66ZfWxm68zsoajyH5jZRjP7xMy+3hGVFhGR42vOeu6LgMeAp2sLzOwcYBYwwTlXbmaDgvIxwDXAWGAo8IaZjXTOVbd3xUVEpHFNttydc+8A++sV3wbMc86VB+fsDspnAc8558qdc18CG4Ep7VhfERFphtb2uY8EZprZMjP7s5mdFpRnAVujzisKykREpBO1dpu9OKA/MBU4DXjezE5syRuY2RxgDkBubm4rqyEiIg1pbcu9CHjJecuBGmAgsA3IiTovOyj7CufcE865AudcQWZmZiurISIiDWltuP8eOAfAzEYCCcBe4BXgGjNLNLPhwAhgeXtUVEREmq/JbhkzexY4GxhoZkXAfcBCYGEwPLICmO2cc8A6M3seWA9UAXM1UkZEpPOZz+TYKigocIWFhbGuhohIt2JmK5xzBQ0d0wxVEZEQUriLiISQwl1EJIQU7iIiIaRwFxEJIYW7iEgIKdxFREJI4S4iEkIKdxGREFK4i4iEkMJdRCSEFO4iIiGkcBcRCSGFu4hICCncRURCSOEuIhJCCncRkRBSuIuIhJDCXUQkhBTuIiIhpHAXEQkhhbuISAgp3EVEQkjhLiISQgp3EZEQajLczWyhme02s7UNHPsHM3NmNjB4bmb2r2a20cxWm9mkjqi0iIgcX3Na7ouAC+sXmlkOcAGwJar4ImBE8DMH+FXbqygiIi3VZLg7594B9jdw6JfA9wEXVTYLeNp57wMZZjakXWoqIiLN1qo+dzObBWxzzq2qdygL2Br1vCgoa+g95phZoZkV7tmzpzXVEBGRRrQ43M0sGbgX+FFbPtg594RzrsA5V5CZmdmWtxIRkXriWvGak4DhwCozA8gGPjSzKcA2ICfq3OygTEREOlGLW+7OuTXOuUHOuTznXB6+62WSc24n8Arw7WDUzFSgxDm3o32rLCIiTWnOUMhngb8Ao8ysyMxuPs7pfwC+ADYCTwK3t0stRUSkRZrslnHOXdvE8byoxw6Y2/ZqiYhIW2iGqohICCncRURCSOEuIhJCCncRkRBSuIuIhJDCXUQkhBTuIiIhpHAXEQkhhbuISAgp3EVEQkjhLiISQgp3EZEQUriLiISQwl1EJIQU7iIiIaRwFxEJIYW7iEgIKdxFREJI4S4iEkIKdxGREFK4i4iEkMJdRCSEFO4iIiGkcBcRCSGFu4hICDUZ7ma20Mx2m9naqLKHzexjM1ttZr8zs4yoYz8ws41m9omZfb2jKi4iIo1rTst9EXBhvbI/AeOcc/nAp8APAMxsDHANMDZ4zb+bWe92q62IiDRLk+HunHsH2F+v7I/Ouarg6ftAdvB4FvCcc67cOfclsBGY0o71FRGRZmiPPvebgNeDx1nA1qhjRUHZV5jZHDMrNLPCPXv2tEM1RESkVpvC3cx+CFQBz7T0tc65J5xzBc65gszMzLZUQ0RE6olr7QvN7AbgEuA855wLircBOVGnZQdlIiLSiVrVcjezC4HvA5c650qjDr0CXGNmiWY2HBgBLG97NUVEpCWabLmb2bPA2cBAMysC7sOPjkkE/mRmAO87577jnFtnZs8D6/HdNXOdc9UdVXkREWmYRXpUYqegoMAVFhbGuhoiIt2Kma1wzhU0dEwzVEVEQkjhLiISQgp3EZEQUriLiISQwl1EJIQU7iIiIaRwFxGJlfJDULq/6fNaodXLD4iISAsdPQBb3odN78HmpbBjFcy8G87953b/KIW7iEhHObwHtiyFTUt8mO9aCzjonQDZp/lgH3VRh3y0wl1EpL0c3O5DvLZlvvcTXx7XB3KmwDn3wrDpkFUA8UkdWhWFu4hIazgHxZuDMF8Cm5fAgS/9sYRUyJ0KE6+FYTNgyESIS+jU6incRUSawznY9zlsfi8S6AeL/LE+/SB3Oky51bfMB+dDr9juMKpwFxFpSE0N7PnYt8g3B33mh3f5Y30H+RDPu9P/mXkK9Opagw8V7iIiADXVsHNNJMg3L4WjwTDFtCwYfhbkzfDdLANOBr/ceZelcBeRnqm6EravjHSzbHkfyg/6Y/3yYNTFQet8BmQM6/JhXp/CXUR6hsoy2LYiaJW/B1uXQ2WwkdzAkTDur32rfNh0SM+KbV3bgcJdRMKp4ogP8M1LfVdLUSFUlwMGJ4yFU6/3QT5sBqRkxrq27U7hLiLhUFYCW5ZFboBu/whqqsB6wZAJwUiWGX6IYnL/WNe2wyncRaR7Kt0fufG5+T1/M9TVQK94yJoE0+/wYZ4zBZLSYl3bTqdwF5Hu4dCuqJEsS2D3el8el+Sn8p/5jz7Ms0+DhOTY1rULULiLSNdUUnTsVP59n/ny+L6QezqM+yYMO8O30uMSY1vXLkjhLiKx55yfuh89lb94sz+WmA7DpsGk632YD8mH3vGxrW83oHAXkc7nHOz91Id47YqJh7b7Y8kD/CiWqbf5bpYTxsZ8Kn93pHAXkY5XU+P7yDcviXSzlO71x1JO8CFeO/tz4KguN5W/O1K4i0j7q66CnauPncpfVuyPpefCyV+LhHn/E7vd7M/uoMlwN7OFwCXAbufcuKCsP/BfQB6wCbjKOXfAzAx4FLgYKAVucM592DFVF5Euo6rCjyuvm8q/DCoO+WP9T4JTvgF5Z/julozc2Na1h2hOy30R8BjwdFTZPcD/Oefmmdk9wfN/Ai4CRgQ/pwO/Cv4UkTCpPOpnfNZOGNr6AVQd9ccyT4H8q3zLPHc6pA2JbV17qCbD3Tn3jpnl1SueBZwdPP4N8DY+3GcBTzvnHPC+mWWY2RDn3I72qrCIxED5Ydi6LNLNsm0FVFcABoPHw+QbgjCfBn0Hxrq2Quv73E+ICuydwAnB4yxga9R5RUHZV8LdzOYAcwByc/U1TaRLOVrsV0ms7WbZvhJcNVhvGDoRTv+O72bJOR36ZMS6ttKANt9Qdc45M3OteN0TwBMABQUFLX69iLSjI/uibn6+BzujNnLOmgxn3OX7y3NOh8SUWNdWmqG14b6rtrvFzIYAu4PybUBO1HnZQZmIdCWHdkaGJG5e4nccgmAj59Pg7B/4MM8ugPg+sa2rtEprw/0VYDYwL/jz5ajyvzOz5/A3UkvU3y7SBRRvicz83LwE9n/hyxNS/VT+/Kv9sMShp3b6Rs7SMZozFPJZ/M3TgWZWBNyHD/XnzexmYDNwVXD6H/DDIDfih0Le2AF1FpHjcc6Hd3TLvCS4FZaU4VvkBTdHNnLurekuYdSc0TLXNnLovAbOdcDctlZKRFrAOd+tUhfmS+HwTn+sb6YP8el3+D8HjdHszx5Cv7JFupuaati1NrJi4pa/QOk+fyx1KAyfGewwdAYMHKHZnz2Uwl2kq6uuhB2rIotsbXkfykv8sYxhMPLCyHZx/fIU5gIo3EW6nqryYCPnIMy3LofKI/7YgBEw7vKojZyzY1tX6bIU7iKxVlEKRR9Ewrzog2AjZ2DQWJh4XWSRrZRBsa2rdBsKd5HOVnbQt8ZrZ39u+xBqKv1GzoPz4bRbIlP5e8BGztIxFO4iHa10fzCVP1jLfOfqYCPnOBg6CabNDabyT4Gk9FjXVkJC4S7S3g7vjowv37wUdq3DT+VP9Js3z/yeb5lnnwYJfWNdWwkphbtIW5Vsi6zJsnmp3z4OID7Zr8Vyzg99mA+dBPFJsa2r9BgKd5GWcM5v3Bw9lf/AJn8sMQ1yp8LEb/luliETtJGzxIzCXeR4nIN9G4+dyn8wWAuvTz8/gmXK3wZT+cdrI2fpMhTuItFqamDPhqiW+VI4Eix62ndQZEjisBmQOVpT+aXLUrhLz1ZdBbvWBGG+FLYshaMH/LG0bDjpnEiYDzhJsz+l21C4S89SVQE7Vka6Wba8H7WR84kw+q/8mizDpkO/YbGtq0gbKNwl3CrLYFthpJul6AOoLPXHBo6C/CsjU/nThsa2riLtSOEu4VJxJNjIeakP9G2FkY2cTxgHk77tgzx3OqRkxrq2Ih1G4S7dW1lJ1OzPJb7LpabKb+Q8ZAKc/re+ZZ471Y9uEekhFO7SvRzZ52961q5lvmttMJU/3m/kPOPvozZyTo11bUViRuEuXduhXZGZn5uW+GGKAHFJfvr+md+PTOXXRs4idRTu0rUUbz12Kv++jb48IcW3xsdf4Wd/Dj0V4hJjW1eRLkzhLrFTu5Fz3SJbS6B4iz+WlO5vek6a7VvmgydoI2eRFtC/Fuk8zsGeTyJBvnkpHNrhjyUP9H3lU+f6MB80RlP5e6jKykqKioooKyuLdVW6jKSkJLKzs4mPb/5aRQr3nqD8sB/b3dm7+NTURDZyru1mqdvIeUhkfHneGTBwpGZ/CgBFRUWkpqaSl5eH6f8JnHPs27ePoqIihg8f3uzXKdzDqrIMNv4J1rwIn/4vnPot+KtfdOxnVldFNnLevAS2/MUPVQTIyIURF/hAz5sB/YYrzKVBZWVlCvYoZsaAAQPYs2dPi16ncA+T6kr44s+w9kXY8KqfVp88EE79G5hwTft/XlU5bP8oMpV/6zKoOOyPDTgZxsyKTOXPyGn/z5fQUrAfqzV/H20KdzO7C7gFcMAa4EZgCPAcMABYAVzvnKtoy+fIcdTU+HHfa16E9S/D0f2QmA5jZ8G4v4a8M9vvRmRF6Ven8lcF/aKDxvhfILVdLamD2+czRaRVWv2v3syygDuAMc65o2b2PHANcDHwS+fcc2b2OHAz8Kt2qa14zvlNldcuhnUv+ZuS8ckw6iIYdwWcfF77DBMsP+Rb47UrJm5bEbWR83gouCmY/TkN+g5o++eJdBGbNm3ikksuYe3atc06//HHHyc5OZlvf/vbjZ6zaNEiCgsLeeyxx75y7Gc/+xn33ntvq+vbkLY26eKAPmZWCSQDO4BzgeuC478B7kfh3nbOwe71PtDXLva7//ROgJPPh/F/DSMvbPt+nEcP+Kn8td0sO1aBqw42cj4Vpt3uu1lyT9dGziJRvvOd77Tp9V0q3J1z28xsPrAFOAr8Ed8NU+ycqwpOKwKy2lzLnmzf55FA3/OxXzPlxLP8zMzRfwV9Mlr/3of3RE3lX+JHtuD8L43s02Dm3b5lnn0aJKa02yWJNNe//Pc61m8/2K7vOWZoGvd9Y2yT51VXV3PrrbeydOlSsrKyePnll9m+fTtz585lz549JCcn8+STTzJ69Gjuv/9+UlJS+N73vscHH3zAzTffTK9evTj//PN5/fXX674BbN++nQsvvJDPP/+cyy+/nIceeoh77rmHo0ePMnHiRMaOHcszzzzTLtfZlm6ZfsAsYDhQDLwAXNiC188B5gDk5ua2thrhVFIEa1/ygb5jpS/Lne5Hu5wyq/WrGR7cERnJsmkJ7P3El8f1gZwpcM69vr88q0AbOUuP99lnn/Hss8/y5JNPctVVV7F48WKeeuopHn/8cUaMGMGyZcu4/fbbefPNN4953Y033siTTz7JtGnTuOeee445tnLlSj766CMSExMZNWoU3/3ud5k3bx6PPfYYK1eubNf6t6Vb5mvAl865PQBm9hIwA8gws7ig9Z4NbGvoxc65J4AnAAoKClwb6hEOh/fA+t/7QN/yF1829FS44Kcw9nJIb+EXIOf8bM/oMD/wpT+WkBps5Hytb5kPmQhxCe17PSLtoDkt7I4yfPhwJk6cCMDkyZPZtGkTS5cu5corr6w7p7y8/JjXFBcXc+jQIaZNmwbAddddx6uvvlp3/LzzziM93Xdpjhkzhs2bN5OT0zEjydoS7luAqWaWjO+WOQ8oBN4CrsCPmJkNvNzWSobW0WL4+FU/0uXLP/vVDTNHwzn/DOO+6bd1ay7nfBdO9CJbB4v8sT79fMt/yq2+ZX7CeE3lF2lCYmJkUELv3r3ZtWsXGRkZbWph13/Pqqqq45zdNm3pc19mZi8CHwJVwEf4lvhrwHNm9kBQtqA9KhoaFUfgk9d9C33jG34jiX55cMbdfujiCWOa9z41Nb4PPnoq/+Fd/ljfzGCy0J0+zDNP0UbOIm2UlpbG8OHDeeGFF7jyyitxzrF69WomTJhQd05GRgapqaksW7aM008/neeee65Z7x0fH09lZWWLlhdoSpuab865+4D76hV/AUxpy/uGTlW5D/I1L8Kn/+OXAkgdClPm+Bb60ElNz9asqYada6IW2Vrqx7QDpGXB8LMiU/kHnKzZnyId4JlnnuG2227jgQceoLKykmuuueaYcAdYsGABt956K7169eKss86q64Y5njlz5pCfn8+kSZPa7YaqORf77u6CggJXWFgY62q0r+oq39WydrGfLVpeAn36w9jL/Fj03GnHb01XV8L2lVFT+d+H8mDUQL+8yMzPvBmQMUxhLqGxYcMGTjnllFhXo9UOHz5MSoofXTZv3jx27NjBo48+2ub3bejvxcxWOOcKGjpfHa/tqaYGtr4fmS1auhcS02D0JX4s+vCzoHcjX7sqy/wkodpFtrYuj9rIeaTvsqmd/dnSm6si0mlee+01fv7zn1NVVcWwYcNYtGhRTOqhcG8r5/z6KmsX++GLh7b7oYWjLgxmi36t4WGFFUd8gNd2sxQVQnU5fiPnsXDq9T7Ih83QRs4i3cjVV1/N1VdfHetqKNxbbfeGyOSi/V/4PTxP/hpc8BM/W7T+pJ+yg5GNnDcv8b8Qaqr8VP4hE4KRLMFGzsn9Y3NNIhIaCveW2P9FZHLR7vU+mIefCWfcBad8ww85rFW6349Xr11ka+fqqI2cJ8H07/p+85wpkJQWu2sSkVBSuDfl4HZY9zvfj779Q1+WMxUuetjfHK3dAOPwbn9e7SJbu9f58rqNnP8xMpU/ITk21yIiPYbCvSFH9gazRV/yQY3zXSfn/xjGftOvTV5SBF+8HVlka99n/rXxff3CWuMu9y3zrEnayFlEOp3CvVZZCXz8mm+hf/G2Xw1x4Eg4+wd+pEqv3r575a2fBRs5b/avS0yHYdNg0vU+zIfkNz4iRkRC7ZZbbuHuu+9mzJjGJyPecMMNXHLJJVxxxRXHlNcub3Ddddc18sqW6dnhXlHqJxWtXQyf/dHPFs3I9f3hQ/L98gBb/gK/+YYfBQOQPCDYyPk2381ywlht5CwiAPz6179u9Ws3bdrEb3/7W4V7q1WVw+dv+hb6J69D5RHoOwhyTvebNlcdhY/+E5bs9eennBDZ93PYDBg4SlP5RTrL6/f4mdntafB4uGjecU95+OGHSUxM5I477uCuu+5i1apVvPnmm7z55pssWLCA2bNnc99991FeXs5JJ53EU089RUpKCmeffTbz58+noKCABQsW8OCDD5KRkcGECRNITEys26jjnXfe4ZFHHmHnzp089NBDXHHFFdxzzz1s2LCBiRMnMnv2bO666642XWbPCPfqKtj0brC36H9HNm0Gv0JiVbk/DpCe64c01oZ5/xM1+1Okh5k5cya/+MUvuOOOOygsLKS8vJzKykreffdd8vPzeeCBB3jjjTfo27cvDz74II888gg/+tGP6l6/fft2fvKTn/Dhhx+SmprKueeee8wyBTt27OC9997j448/5tJLL+WKK65g3rx5zJ8//5hVJNsivOFeUwNFy4PZor+HI43sHJ4yKLImy7DpvltGRLqGJlrYHWXy5MmsWLGCgwcPkpiYyKRJkygsLOTdd9/l0ksvZf369cyYMQOAioqKuiV+ay1fvpyzzjqL/v39nJUrr7ySTz/9tO74ZZddRq9evRgzZgy7du3qkGsIV7g757eGW/sirP1dZMnbaJmjI90sudMhbUjn11NEurT4+HiGDx/OokWLmD59Ovn5+bz11lts3LiR4cOHc/755/Pss8+2+v2jl/7tqPW9whHuez7xLfS1i2H/51EHzPev1YX5NOg7MGbVFJHuY+bMmcyfP5+FCxcyfvx47r77biZPnszUqVOZO3cuGzdu5OSTT+bIkSNs27aNkSNH1r32tNNO48477+TAgQOkpqayePFixo8ff9zPS01N5dChQ+1W/+4d7gd3wG+v8rM/we8vmjU5CPMz/E3StuwxKiI91syZM/npT3/KtGnT6Nu3L0lJScycOZPMzEwWLVrEtddeW7cT0wMPPHBMuGdlZXHvvfcyZcoU+vfvz+jRo5tc+jc/P5/evXszYcIEbrjhhjbfUO3eS/4e3A6v/5Mfjz5sug9zbeQs0q119yV/a9Uu/VtVVcXll1/OTTfdxOWXX97q9+tZS/6mDYWr/yPWtRAR+Yr777+fN954gzH7heoAAAgFSURBVLKyMi644AIuu+yyTv387h3uIiJd1Pz582P6+ZqNIyJdTlfoLu5KWvP3oXAXkS4lKSmJffv2KeADzjn27dtHUlIDm/4ch7plRKRLyc7OpqioiD17Gpl42AMlJSWRnZ3dotco3EWkS6mdQCRto24ZEZEQUriLiISQwl1EJIS6xAxVM9sDbI51PVppILA31pWIkZ567T31uqHnXntXve5hzrnMhg50iXDvzsyssLHpv2HXU6+9p1439Nxr747XrW4ZEZEQUriLiISQwr3tnoh1BWKop157T71u6LnX3u2uW33uIiIhpJa7iEgIKdxFREJI4d4CZpZjZm+Z2XozW2dmf1/v+D+YmTOzUG3UerzrNrPvmtnHQflDsaxnR2js2s1sopm9b2YrzazQzKbEuq7tycySzGy5ma0KrvtfgvLhZrbMzDaa2X+ZWUKs69rejnPtz5jZJ2a21swWmll8rOt6XM45/TTzBxgCTAoepwKfAmOC5znA/+InYw2MdV0747qBc4A3gMTg2KBY17UTr/2PwEVB+cXA27GuaztftwEpweN4YBkwFXgeuCYofxy4LdZ17cRrvzg4ZsCzXf3a1XJvAefcDufch8HjQ8AGICs4/Evg+0Do7lAf57pvA+Y558qDY7tjV8uOcZxrd0BacFo6sD02NewYzjscPI0PfhxwLvBiUP4boHP3jusEjV27c+4PwTEHLAdatgZvJ1O4t5KZ5QGnAsvMbBawzTm3KqaV6gTR1w2MBGYGX9P/bGanxbJuHa3etd8JPGxmW4H5wA9iV7OOYWa9zWwlsBv4E/A5UOycqwpOKSLSuAmV+tfunFsWdSweuB74n1jVrzkU7q1gZinAYvw/8CrgXuBHMa1UJ4i+bufcQfx+AP3xX1n/EXjezCyGVewwDVz7bcBdzrkc4C5gQSzr1xGcc9XOuYn4FuoUYHSMq9Rp6l+7mY2LOvzvwDvOuXdjU7vmUbi3UPBbezHwjHPuJeAkYDiwysw24f9n+NDMBseulu2vgesG33J7KfimuhyowS+wFCqNXPtsoPbxC/jwCyXnXDHwFjANyDCz2k1+soFtMatYJ4i69gsBzOw+IBO4O5b1ag6FewsErdIFwAbn3CMAzrk1zrlBzrk851wePvAmOed2xrCq7aqh6w78Hn9TFTMbCSTQNVfOa7XjXPt24Kzg8bnAZ51dt45kZplmlhE87gOcj7/f8BZwRXDabODl2NSw4zRy7R+b2S3A14FrnXM1saxjc2iGaguY2RnAu8AafCsV4F7n3B+iztkEFDjnQhNyjV03fqTMQmAiUAF8zzn3Zkwq2UGOc+0HgUfxXVNlwO3OuRUxqWQHMLN8/A3T3vhG4PPOuR+b2YnAc/juuI+Av6m9oR4Wx7n2KvxouEPBqS85534co2o2SeEuIhJC6pYREQkhhbuISAgp3EVEQkjhLiISQgp3EZEQUriLiISQwl06nZnlmdnaVr72bDN7tb3r1JHMrMDM/rWFr7nfzL7XUXWS8Itr+hQRaQvnXCFQGOt6SM+ilrvESlyw+cEGM3vRzJLN7Dwz+8jM1gSbISQCmNmFwYYgHwLfDMp6mdlnZpYZ9Xxj7fP6zGyRmf0q2GDji+AbwMLg8xdFnferYPONuk0agvJ5wYYdq81sflB2ZbBxwyoze6exC43+thG0yBea2dtBPe6IOu+HZvapmb0HjIoqP8nM/sfMVpjZu2Y22szizOwDMzs7OOfnZvbTlv9nkNCK9YLy+ul5P0Aefm3wGcHzhcA/A1uBkUHZ0/hVN5OC8hH4TRKeB14NzrkPv0ojwAXA4uN85iL8tHkDZuGXDxiPb+CsACYG5/UP/uwNvA3kAwOAT4jM6M4I/lwDZEWXNfLZZ0fV+X5gKZCIX2RtH3698MnB+yXj14nfiF/OAeD/gBHB49OBN4PHY/HrvXwNvxRAQqz/2+qn6/yo5S6xstU5tyR4/J/AecCXzrlPg7LfAGfil5n90jn3mXPOBefWWgh8O3h8E/BUE5/538F7rAF2Ob/oWw2wDv8LB+Cq4BvCR/jwHAOU4NePWWBm3wRKg3OXAIvM7Fb8L4Pmes05V+78+kO7gROAmcDvnHOlzi8p/ArULTU8HXghWF/8/+F3h8I5tw74D+BV4CbnXEUL6iAhp3CXWKm/qFFxi9/Aua3ALjM7F7/k7utNvKR2gauaqMe1z+PMbDjwPeA851w+8BqQ5PzmFFPwOxBdQrBJg3PuO/hvHDnACjMb0MyqR392Nce/99ULv0HGxKifU6KOj8f/3Q1q5mdLD6Fwl1jJNbNpwePr8Dcc88zs5KDseuDPwMdB+UlB+bX13ufX+Nb8C8656jbWKQ04ApSY2QnARVDXek53fvXPu4AJQflJzrllzrkfAXvwId9a7wCXmVkfM0sFvgEQtOK/NLMrg880M6v9/G/iV2c8E/i32mVqRUDhLrHzCTDXzDYA/fB70N6I736oXV73cedcGTAHeC3oLqm/T+srQApNd8k0yfltEj/C/0L5Lb7bBfzG2K+a2WrgPSIbNTwc3Pxdi+9Hb/U2i87v0/pfwXu8DnwQdfhbwM1mtgrfhTTLzAYC84Bbgq6sx/BLEIsAWvJXujkzKwB+6ZybGeu6iHQlGucu3ZaZ3YPfy/Rbsa6LSFejlruEipn9ELiyXvELzrkOHwNuZl8HHqxX/KVz7vKO/myR+hTuIiIhpBuqIiIhpHAXEQkhhbuISAgp3EVEQuj/A6y/m6qL6lQYAAAAAElFTkSuQmCC\n" }, "metadata": { "needs_background": "light" } } ], "source": [ "people.plot(kind = \"line\", x = \"body_mass_index\", y = [\"height\", \"weight\"])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "l0DjpOZJbdkj" }, "source": [ "Again, there are way too many options to list here: the best option is to scroll through the [Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html) page in pandas' documentation, find the plot you are interested in and look at the example code." ] }, { "cell_type": "markdown", "metadata": { "id": "6m0s18-Hbdkj" }, "source": [ "#### Operations on `DataFrame`s\n", "Although `DataFrame`s do not try to mimick NumPy arrays, there are a few similarities. Let's create a `DataFrame` to demonstrate this:" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "id": "-D1z8SV2bdkj", "outputId": "ed63ff4a-091a-4273-f8ec-0a5590d5f5e1", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " sep oct nov\n", "alice 8 8 9\n", "bob 10 9 9\n", "charles 4 8 2\n", "darwin 9 10 10" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepoctnov
alice889
bob1099
charles482
darwin91010
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 113 } ], "source": [ "grades_array = np.array([[8,8,9],[10,9,9],[4, 8, 2], [9, 10, 10]])\n", "grades = pd.DataFrame(grades_array, columns=[\"sep\", \"oct\", \"nov\"], index=[\"alice\",\"bob\",\"charles\",\"darwin\"])\n", "grades" ] }, { "cell_type": "markdown", "metadata": { "id": "YNYV5RC2bdkj" }, "source": [ "You can apply NumPy mathematical functions on a `DataFrame`: the function is applied to all values:" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "id": "tVS8LYkqbdkj", "outputId": "38d63ab6-a283-4e63-f5e0-fc75db2711fa", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " sep oct nov\n", "alice 2.828427 2.828427 3.000000\n", "bob 3.162278 3.000000 3.000000\n", "charles 2.000000 2.828427 1.414214\n", "darwin 3.000000 3.162278 3.162278" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepoctnov
alice2.8284272.8284273.000000
bob3.1622783.0000003.000000
charles2.0000002.8284271.414214
darwin3.0000003.1622783.162278
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 114 } ], "source": [ "np.sqrt(grades)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "id": "H9ZQAzESbdkj", "outputId": "fcfd6dba-2682-4801-c339-c0392be70cd2", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " sep oct nov\n", "alice 9 9 10\n", "bob 11 10 10\n", "charles 5 9 3\n", "darwin 10 11 11" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepoctnov
alice9910
bob111010
charles593
darwin101111
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 115 } ], "source": [ "grades + 1" ] }, { "cell_type": "markdown", "metadata": { "id": "tUIPt4C-bdkk" }, "source": [ "Aggregation operations, such as computing the `max`, the `sum` or the `mean` of a `DataFrame`, apply to each column, and you get back a `Series` object:" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "id": "1C8lhBIXbdkk", "outputId": "d05f160a-9996-4ffa-883e-0658a5fa34d7", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "sep 7.75\n", "oct 8.75\n", "nov 7.50\n", "dtype: float64" ] }, "metadata": {}, "execution_count": 116 } ], "source": [ "grades.mean()" ] }, { "cell_type": "markdown", "metadata": { "id": "qWRNPxVGbdkl" }, "source": [ "Most of these functions take an optional `axis` parameter which lets you specify along which axis of the `DataFrame` you want the operation executed. The default is `axis=0`, meaning that the operation is executed vertically (on each column). You can set `axis=1` to execute the operation horizontally (on each row). For example, let's find out which students had all grades greater than `5`:" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "id": "HoyAlWC6bdkl", "outputId": "afee0091-1564-40c4-aff7-3e43d167ac8a", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "alice True\n", "bob True\n", "charles False\n", "darwin True\n", "dtype: bool" ] }, "metadata": {}, "execution_count": 117 } ], "source": [ "(grades > 5).all(axis = 1)" ] }, { "cell_type": "markdown", "metadata": { "id": "mdIwFwAgbdkl" }, "source": [ "If you add a `Series` object to a `DataFrame` (or execute any other binary operation), pandas attempts to broadcast the operation to all *rows* in the `DataFrame`. This only works if the `Series` has the same size as the `DataFrame`s rows. For example, let's subtract the `mean` of the `DataFrame` (a `Series` object) from the `DataFrame`:" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "id": "sUzEkkg1bdkl", "outputId": "2c56256b-9c5e-4306-e47b-03b1457c099c", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " sep oct nov\n", "alice 0.25 -0.75 1.5\n", "bob 2.25 0.25 1.5\n", "charles -3.75 -0.75 -5.5\n", "darwin 1.25 1.25 2.5" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepoctnov
alice0.25-0.751.5
bob2.250.251.5
charles-3.75-0.75-5.5
darwin1.251.252.5
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 118 } ], "source": [ "grades - grades.mean() # equivalent to: grades - [7.75, 8.75, 7.50]" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "id": "xOU7mlmVbdkm", "outputId": "e6207be5-2db1-48fe-9b22-f0fb34aaca76", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " sep oct nov\n", "alice 7.75 8.75 7.5\n", "bob 7.75 8.75 7.5\n", "charles 7.75 8.75 7.5\n", "darwin 7.75 8.75 7.5" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepoctnov
alice7.758.757.5
bob7.758.757.5
charles7.758.757.5
darwin7.758.757.5
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 119 } ], "source": [ "# We subtracted `7.75` from all September grades, `8.75` from October grades and `7.50` \n", "# from November grades. It is equivalent to subtracting this `DataFrame`:\n", "pd.DataFrame([[7.75, 8.75, 7.50]]*4, index=grades.index, columns=grades.columns)" ] }, { "cell_type": "markdown", "metadata": { "id": "cc1Lz1SSbdkm" }, "source": [ "If you want to subtract the global mean from every grade, here is one way to do it:" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "scrolled": true, "id": "nCJUsodMbdkm", "outputId": "18fdedb1-d8e4-409b-8edb-ae749e8be8e3", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " sep oct nov\n", "alice 0.0 0.0 1.0\n", "bob 2.0 1.0 1.0\n", "charles -4.0 0.0 -6.0\n", "darwin 1.0 2.0 2.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepoctnov
alice0.00.01.0
bob2.01.01.0
charles-4.00.0-6.0
darwin1.02.02.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 120 } ], "source": [ "grades - grades.values.mean() # subtracts the global mean (8.00) from all grades" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "id": "Qeykr6KIbdkm", "outputId": "0861bc4b-7f51-41c6-87fd-5a60b4acdc8a", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " oct nov dec\n", "bob 0.0 NaN 2.0\n", "colin NaN 1.0 0.0\n", "darwin 0.0 1.0 0.0\n", "charles 3.0 3.0 0.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
octnovdec
bob0.0NaN2.0
colinNaN1.00.0
darwin0.01.00.0
charles3.03.00.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 121 } ], "source": [ "bonus_array = np.array([[0,np.nan,2],[np.nan,1,0],[0, 1, 0], [3, 3, 0]])\n", "bonus_points = pd.DataFrame(bonus_array, columns=[\"oct\", \"nov\", \"dec\"], index=[\"bob\",\"colin\", \"darwin\", \"charles\"])\n", "bonus_points" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "scrolled": true, "id": "PmBlxZ2bbdkm", "outputId": "5c6c880c-eb65-41cd-e1ab-66a41604fc3d", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " dec nov oct sep\n", "alice NaN NaN NaN NaN\n", "bob NaN NaN 9.0 NaN\n", "charles NaN 5.0 11.0 NaN\n", "colin NaN NaN NaN NaN\n", "darwin NaN 11.0 10.0 NaN" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
decnovoctsep
aliceNaNNaNNaNNaN
bobNaNNaN9.0NaN
charlesNaN5.011.0NaN
colinNaNNaNNaNNaN
darwinNaN11.010.0NaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 122 } ], "source": [ "grades + bonus_points" ] }, { "cell_type": "markdown", "metadata": { "id": "ZEtUuQNHbdkm" }, "source": [ "#### Handling missing data\n", "Dealing with missing data is a frequent task when working with real life data. Pandas offers a few tools to handle missing data.\n", " \n", "Let's try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of `NaN`. We can replace all `NaN` values by a any value using the `fillna()` method:" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "scrolled": true, "id": "89vMid5vbdkm", "outputId": "4f8a9ba1-42e1-43f9-a1da-e118325c7ea4", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " dec nov oct sep\n", "alice 0.0 0.0 0.0 0.0\n", "bob 0.0 0.0 9.0 0.0\n", "charles 0.0 5.0 11.0 0.0\n", "colin 0.0 0.0 0.0 0.0\n", "darwin 0.0 11.0 10.0 0.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
decnovoctsep
alice0.00.00.00.0
bob0.00.09.00.0
charles0.05.011.00.0
colin0.00.00.00.0
darwin0.011.010.00.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 123 } ], "source": [ "(grades + bonus_points).fillna(0)" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "scrolled": true, "id": "Nb9szHYrbdko", "outputId": "a8d1cfde-0587-49a0-def0-222278a6caac", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " dec nov oct sep\n", "alice NaN NaN NaN NaN\n", "bob NaN NaN 9.0 NaN\n", "charles NaN 5.0 11.0 NaN\n", "colin NaN NaN NaN NaN\n", "darwin NaN 11.0 10.0 NaN" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
decnovoctsep
aliceNaNNaNNaNNaN
bobNaNNaN9.0NaN
charlesNaN5.011.0NaN
colinNaNNaNNaNNaN
darwinNaN11.010.0NaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 124 } ], "source": [ "final_grades = grades + bonus_points\n", "final_grades" ] }, { "cell_type": "markdown", "metadata": { "id": "84Ov3x5Ubdko" }, "source": [ "We can call the `dropna()` method to get rid of rows that are full of `NaN`s:" ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "id": "ACmmOPyrbdko", "outputId": "6ee6fbfc-d4a4-4ef4-de42-2e131edeca8d", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " dec nov oct sep\n", "bob NaN NaN 9.0 NaN\n", "charles NaN 5.0 11.0 NaN\n", "darwin NaN 11.0 10.0 NaN" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
decnovoctsep
bobNaNNaN9.0NaN
charlesNaN5.011.0NaN
darwinNaN11.010.0NaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 125 } ], "source": [ "final_grades_clean = final_grades.dropna(how=\"all\")\n", "final_grades_clean" ] }, { "cell_type": "markdown", "metadata": { "id": "F_ft7zzcbdko" }, "source": [ "Now let's remove columns that are full of `NaN`s by setting the `axis` argument to `1`:" ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "id": "DeqnrLEnbdko", "outputId": "d957894d-f4d0-4dcb-c788-48ccf66a5394", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " nov oct\n", "bob NaN 9.0\n", "charles 5.0 11.0\n", "darwin 11.0 10.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
novoct
bobNaN9.0
charles5.011.0
darwin11.010.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 126 } ], "source": [ "final_grades_clean = final_grades_clean.dropna(axis=1, how=\"all\")\n", "final_grades_clean" ] }, { "cell_type": "markdown", "metadata": { "id": "NbyYrlrzbdko" }, "source": [ "#### Aggregating with `groupby`\n", "Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group.\n", "\n", "First, let's add some extra data about each person so we can group them, and let's go back to the `final_grades` `DataFrame` so we can see how `NaN` values are handled:" ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "scrolled": true, "id": "II2IdemTbdkp", "outputId": "c78cf8f1-a24c-416a-a4b7-5a7188b2c34d", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " dec nov oct sep hobby\n", "alice NaN NaN NaN NaN Biking\n", "bob NaN NaN 9.0 NaN Dancing\n", "charles NaN 5.0 11.0 NaN NaN\n", "colin NaN NaN NaN NaN Dancing\n", "darwin NaN 11.0 10.0 NaN Biking" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
decnovoctsephobby
aliceNaNNaNNaNNaNBiking
bobNaNNaN9.0NaNDancing
charlesNaN5.011.0NaNNaN
colinNaNNaNNaNNaNDancing
darwinNaN11.010.0NaNBiking
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 127 } ], "source": [ "final_grades[\"hobby\"] = [\"Biking\", \"Dancing\", np.nan, \"Dancing\", \"Biking\"]\n", "final_grades" ] }, { "cell_type": "markdown", "metadata": { "id": "2suQBYhGbdkp" }, "source": [ "Now let's group data in this `DataFrame` by hobby:" ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "id": "dOhALCgkbdkp", "outputId": "a9170462-2a5a-4d52-a2e5-1c0b873e5195", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 128 } ], "source": [ "grouped_grades = final_grades.groupby(\"hobby\")\n", "grouped_grades" ] }, { "cell_type": "markdown", "metadata": { "id": "DZAFvTlTbdkp" }, "source": [ "We are ready to compute the average grade per hobby:" ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "id": "o5Gs6lcLbdkp", "outputId": "992a9833-c604-45be-8893-c6cfbf97e791", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " dec nov oct sep\n", "hobby \n", "Biking NaN 11.0 10.0 NaN\n", "Dancing NaN NaN 9.0 NaN" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
decnovoctsep
hobby
BikingNaN11.010.0NaN
DancingNaNNaN9.0NaN
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 129 } ], "source": [ "grouped_grades.mean()" ] }, { "cell_type": "markdown", "metadata": { "id": "pgOZUDdRbdkp" }, "source": [ "That was easy! Note that the `NaN` values have simply been skipped when computing the means." ] }, { "cell_type": "markdown", "metadata": { "id": "PB0GOgYLbdkp" }, "source": [ "#### Pivot tables\n", "Pandas supports spreadsheet-like [pivot tables](https://en.wikipedia.org/wiki/Pivot_table) that allow quick data summarization." ] }, { "cell_type": "markdown", "metadata": { "id": "rK5gDiX6bdkr" }, "source": [ "#### Overview functions\n", "When dealing with large `DataFrames`, it is useful to get a quick overview of its content. Pandas offers a few functions for this. First, let's create a large `DataFrame` with a mix of numeric values, missing values and text values. Notice how Jupyter displays only the corners of the `DataFrame`:" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "id": "6u57h9MQbdkr", "outputId": "f86cea34-090e-4e85-c6c7-e02d3f1f07ad", "colab": { "base_uri": "https://localhost:8080/", "height": 424 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C some_text D E F G H I \\\n", "0 NaN 11.0 44.0 Blabla 99.0 NaN 88.0 22.0 165.0 143.0 \n", "1 11.0 22.0 55.0 Blabla 110.0 NaN 99.0 33.0 NaN 154.0 \n", "2 22.0 33.0 66.0 Blabla 121.0 11.0 110.0 44.0 NaN 165.0 \n", "3 33.0 44.0 77.0 Blabla 132.0 22.0 121.0 55.0 11.0 NaN \n", "4 44.0 55.0 88.0 Blabla 143.0 33.0 132.0 66.0 22.0 NaN \n", "... ... ... ... ... ... ... ... ... ... ... \n", "9995 NaN NaN 33.0 Blabla 88.0 165.0 77.0 11.0 154.0 132.0 \n", "9996 NaN 11.0 44.0 Blabla 99.0 NaN 88.0 22.0 165.0 143.0 \n", "9997 11.0 22.0 55.0 Blabla 110.0 NaN 99.0 33.0 NaN 154.0 \n", "9998 22.0 33.0 66.0 Blabla 121.0 11.0 110.0 44.0 NaN 165.0 \n", "9999 33.0 44.0 77.0 Blabla 132.0 22.0 121.0 55.0 11.0 NaN \n", "\n", " ... Q R S T U V W X Y Z \n", "0 ... 11.0 NaN 11.0 44.0 99.0 NaN 88.0 22.0 165.0 143.0 \n", "1 ... 22.0 11.0 22.0 55.0 110.0 NaN 99.0 33.0 NaN 154.0 \n", "2 ... 33.0 22.0 33.0 66.0 121.0 11.0 110.0 44.0 NaN 165.0 \n", "3 ... 44.0 33.0 44.0 77.0 132.0 22.0 121.0 55.0 11.0 NaN \n", "4 ... 55.0 44.0 55.0 88.0 143.0 33.0 132.0 66.0 22.0 NaN \n", "... ... ... ... ... ... ... ... ... ... ... ... \n", "9995 ... NaN NaN NaN 33.0 88.0 165.0 77.0 11.0 154.0 132.0 \n", "9996 ... 11.0 NaN 11.0 44.0 99.0 NaN 88.0 22.0 165.0 143.0 \n", "9997 ... 22.0 11.0 22.0 55.0 110.0 NaN 99.0 33.0 NaN 154.0 \n", "9998 ... 33.0 22.0 33.0 66.0 121.0 11.0 110.0 44.0 NaN 165.0 \n", "9999 ... 44.0 33.0 44.0 77.0 132.0 22.0 121.0 55.0 11.0 NaN \n", "\n", "[10000 rows x 27 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCsome_textDEFGHI...QRSTUVWXYZ
0NaN11.044.0Blabla99.0NaN88.022.0165.0143.0...11.0NaN11.044.099.0NaN88.022.0165.0143.0
111.022.055.0Blabla110.0NaN99.033.0NaN154.0...22.011.022.055.0110.0NaN99.033.0NaN154.0
222.033.066.0Blabla121.011.0110.044.0NaN165.0...33.022.033.066.0121.011.0110.044.0NaN165.0
333.044.077.0Blabla132.022.0121.055.011.0NaN...44.033.044.077.0132.022.0121.055.011.0NaN
444.055.088.0Blabla143.033.0132.066.022.0NaN...55.044.055.088.0143.033.0132.066.022.0NaN
..................................................................
9995NaNNaN33.0Blabla88.0165.077.011.0154.0132.0...NaNNaNNaN33.088.0165.077.011.0154.0132.0
9996NaN11.044.0Blabla99.0NaN88.022.0165.0143.0...11.0NaN11.044.099.0NaN88.022.0165.0143.0
999711.022.055.0Blabla110.0NaN99.033.0NaN154.0...22.011.022.055.0110.0NaN99.033.0NaN154.0
999822.033.066.0Blabla121.011.0110.044.0NaN165.0...33.022.033.066.0121.011.0110.044.0NaN165.0
999933.044.077.0Blabla132.022.0121.055.011.0NaN...44.033.044.077.0132.022.0121.055.011.0NaN
\n", "

10000 rows × 27 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 130 } ], "source": [ "much_data = np.fromfunction(lambda x,y: (x+y*y)%17*11, (10000, 26))\n", "large_df = pd.DataFrame(much_data, columns=list(\"ABCDEFGHIJKLMNOPQRSTUVWXYZ\"))\n", "large_df[large_df % 16 == 0] = np.nan\n", "large_df.insert(3,\"some_text\", \"Blabla\")\n", "large_df" ] }, { "cell_type": "markdown", "metadata": { "id": "qwsVdaFNbdkr" }, "source": [ "The `head()` method returns the top 5 rows:" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "id": "ZD-qCazAbdkr", "outputId": "8b563e33-1328-48a0-80eb-0baf3373c098", "colab": { "base_uri": "https://localhost:8080/", "height": 236 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C some_text D E F G H I ... \\\n", "0 NaN 11.0 44.0 Blabla 99.0 NaN 88.0 22.0 165.0 143.0 ... \n", "1 11.0 22.0 55.0 Blabla 110.0 NaN 99.0 33.0 NaN 154.0 ... \n", "2 22.0 33.0 66.0 Blabla 121.0 11.0 110.0 44.0 NaN 165.0 ... \n", "3 33.0 44.0 77.0 Blabla 132.0 22.0 121.0 55.0 11.0 NaN ... \n", "4 44.0 55.0 88.0 Blabla 143.0 33.0 132.0 66.0 22.0 NaN ... \n", "\n", " Q R S T U V W X Y Z \n", "0 11.0 NaN 11.0 44.0 99.0 NaN 88.0 22.0 165.0 143.0 \n", "1 22.0 11.0 22.0 55.0 110.0 NaN 99.0 33.0 NaN 154.0 \n", "2 33.0 22.0 33.0 66.0 121.0 11.0 110.0 44.0 NaN 165.0 \n", "3 44.0 33.0 44.0 77.0 132.0 22.0 121.0 55.0 11.0 NaN \n", "4 55.0 44.0 55.0 88.0 143.0 33.0 132.0 66.0 22.0 NaN \n", "\n", "[5 rows x 27 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCsome_textDEFGHI...QRSTUVWXYZ
0NaN11.044.0Blabla99.0NaN88.022.0165.0143.0...11.0NaN11.044.099.0NaN88.022.0165.0143.0
111.022.055.0Blabla110.0NaN99.033.0NaN154.0...22.011.022.055.0110.0NaN99.033.0NaN154.0
222.033.066.0Blabla121.011.0110.044.0NaN165.0...33.022.033.066.0121.011.0110.044.0NaN165.0
333.044.077.0Blabla132.022.0121.055.011.0NaN...44.033.044.077.0132.022.0121.055.011.0NaN
444.055.088.0Blabla143.033.0132.066.022.0NaN...55.044.055.088.0143.033.0132.066.022.0NaN
\n", "

5 rows × 27 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 131 } ], "source": [ "large_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "An6ZPfSvbdkr" }, "source": [ "Of course there's also a `tail()` function to view the bottom 5 rows. You can pass the number of rows you want:" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "id": "a4I2ghvbbdkr", "outputId": "e02cb78f-9f24-42c5-ff15-8eb7056f690f", "colab": { "base_uri": "https://localhost:8080/", "height": 141 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C some_text D E F G H I ... \\\n", "9998 22.0 33.0 66.0 Blabla 121.0 11.0 110.0 44.0 NaN 165.0 ... \n", "9999 33.0 44.0 77.0 Blabla 132.0 22.0 121.0 55.0 11.0 NaN ... \n", "\n", " Q R S T U V W X Y Z \n", "9998 33.0 22.0 33.0 66.0 121.0 11.0 110.0 44.0 NaN 165.0 \n", "9999 44.0 33.0 44.0 77.0 132.0 22.0 121.0 55.0 11.0 NaN \n", "\n", "[2 rows x 27 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCsome_textDEFGHI...QRSTUVWXYZ
999822.033.066.0Blabla121.011.0110.044.0NaN165.0...33.022.033.066.0121.011.0110.044.0NaN165.0
999933.044.077.0Blabla132.022.0121.055.011.0NaN...44.033.044.077.0132.022.0121.055.011.0NaN
\n", "

2 rows × 27 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 132 } ], "source": [ "large_df.tail(n=2)" ] }, { "cell_type": "markdown", "metadata": { "id": "WtzFculCbdkr" }, "source": [ "The `info()` method prints out a summary of each columns contents:" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "id": "m0kK-Undbdkr", "outputId": "d2b9328d-0542-4815-c628-019e01dfc7e3", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", "RangeIndex: 10000 entries, 0 to 9999\n", "Data columns (total 27 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 A 8823 non-null float64\n", " 1 B 8824 non-null float64\n", " 2 C 8824 non-null float64\n", " 3 some_text 10000 non-null object \n", " 4 D 8824 non-null float64\n", " 5 E 8822 non-null float64\n", " 6 F 8824 non-null float64\n", " 7 G 8824 non-null float64\n", " 8 H 8822 non-null float64\n", " 9 I 8823 non-null float64\n", " 10 J 8823 non-null float64\n", " 11 K 8822 non-null float64\n", " 12 L 8824 non-null float64\n", " 13 M 8824 non-null float64\n", " 14 N 8822 non-null float64\n", " 15 O 8824 non-null float64\n", " 16 P 8824 non-null float64\n", " 17 Q 8824 non-null float64\n", " 18 R 8823 non-null float64\n", " 19 S 8824 non-null float64\n", " 20 T 8824 non-null float64\n", " 21 U 8824 non-null float64\n", " 22 V 8822 non-null float64\n", " 23 W 8824 non-null float64\n", " 24 X 8824 non-null float64\n", " 25 Y 8822 non-null float64\n", " 26 Z 8823 non-null float64\n", "dtypes: float64(26), object(1)\n", "memory usage: 2.1+ MB\n" ] } ], "source": [ "large_df.info()" ] }, { "cell_type": "markdown", "metadata": { "id": "-93LVb6xbdks" }, "source": [ "Finally, the `describe()` method gives a nice overview of the main aggregated values over each column:\n", "* `count`: number of non-null (not NaN) values\n", "* `mean`: mean of non-null values\n", "* `std`: [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) of non-null values\n", "* `min`: minimum of non-null values\n", "* `25%`, `50%`, `75%`: 25th, 50th and 75th [percentile](https://en.wikipedia.org/wiki/Percentile) of non-null values\n", "* `max`: maximum of non-null values" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "id": "DqG1O-2Cbdks", "outputId": "5f62fedb-b288-46bc-ab3c-400708640e8f", "colab": { "base_uri": "https://localhost:8080/", "height": 394 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C D E \\\n", "count 8823.000000 8824.000000 8824.000000 8824.000000 8822.000000 \n", "mean 87.977559 87.972575 87.987534 88.012466 87.983791 \n", "std 47.535911 47.535523 47.521679 47.521679 47.535001 \n", "min 11.000000 11.000000 11.000000 11.000000 11.000000 \n", "25% 44.000000 44.000000 44.000000 44.000000 44.000000 \n", "50% 88.000000 88.000000 88.000000 88.000000 88.000000 \n", "75% 132.000000 132.000000 132.000000 132.000000 132.000000 \n", "max 165.000000 165.000000 165.000000 165.000000 165.000000 \n", "\n", " F G H I J ... \\\n", "count 8824.000000 8824.000000 8822.000000 8823.000000 8823.000000 ... \n", "mean 88.007480 87.977561 88.000000 88.022441 88.022441 ... \n", "std 47.519371 47.529755 47.536879 47.535911 47.535911 ... \n", "min 11.000000 11.000000 11.000000 11.000000 11.000000 ... \n", "25% 44.000000 44.000000 44.000000 44.000000 44.000000 ... \n", "50% 88.000000 88.000000 88.000000 88.000000 88.000000 ... \n", "75% 132.000000 132.000000 132.000000 132.000000 132.000000 ... \n", "max 165.000000 165.000000 165.000000 165.000000 165.000000 ... \n", "\n", " Q R S T U \\\n", "count 8824.000000 8823.000000 8824.000000 8824.000000 8824.000000 \n", "mean 87.972575 87.977559 87.972575 87.987534 88.012466 \n", "std 47.535523 47.535911 47.535523 47.521679 47.521679 \n", "min 11.000000 11.000000 11.000000 11.000000 11.000000 \n", "25% 44.000000 44.000000 44.000000 44.000000 44.000000 \n", "50% 88.000000 88.000000 88.000000 88.000000 88.000000 \n", "75% 132.000000 132.000000 132.000000 132.000000 132.000000 \n", "max 165.000000 165.000000 165.000000 165.000000 165.000000 \n", "\n", " V W X Y Z \n", "count 8822.000000 8824.000000 8824.000000 8822.000000 8823.000000 \n", "mean 87.983791 88.007480 87.977561 88.000000 88.022441 \n", "std 47.535001 47.519371 47.529755 47.536879 47.535911 \n", "min 11.000000 11.000000 11.000000 11.000000 11.000000 \n", "25% 44.000000 44.000000 44.000000 44.000000 44.000000 \n", "50% 88.000000 88.000000 88.000000 88.000000 88.000000 \n", "75% 132.000000 132.000000 132.000000 132.000000 132.000000 \n", "max 165.000000 165.000000 165.000000 165.000000 165.000000 \n", "\n", "[8 rows x 26 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDEFGHIJ...QRSTUVWXYZ
count8823.0000008824.0000008824.0000008824.0000008822.0000008824.0000008824.0000008822.0000008823.0000008823.000000...8824.0000008823.0000008824.0000008824.0000008824.0000008822.0000008824.0000008824.0000008822.0000008823.000000
mean87.97755987.97257587.98753488.01246687.98379188.00748087.97756188.00000088.02244188.022441...87.97257587.97755987.97257587.98753488.01246687.98379188.00748087.97756188.00000088.022441
std47.53591147.53552347.52167947.52167947.53500147.51937147.52975547.53687947.53591147.535911...47.53552347.53591147.53552347.52167947.52167947.53500147.51937147.52975547.53687947.535911
min11.00000011.00000011.00000011.00000011.00000011.00000011.00000011.00000011.00000011.000000...11.00000011.00000011.00000011.00000011.00000011.00000011.00000011.00000011.00000011.000000
25%44.00000044.00000044.00000044.00000044.00000044.00000044.00000044.00000044.00000044.000000...44.00000044.00000044.00000044.00000044.00000044.00000044.00000044.00000044.00000044.000000
50%88.00000088.00000088.00000088.00000088.00000088.00000088.00000088.00000088.00000088.000000...88.00000088.00000088.00000088.00000088.00000088.00000088.00000088.00000088.00000088.000000
75%132.000000132.000000132.000000132.000000132.000000132.000000132.000000132.000000132.000000132.000000...132.000000132.000000132.000000132.000000132.000000132.000000132.000000132.000000132.000000132.000000
max165.000000165.000000165.000000165.000000165.000000165.000000165.000000165.000000165.000000165.000000...165.000000165.000000165.000000165.000000165.000000165.000000165.000000165.000000165.000000165.000000
\n", "

8 rows × 26 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 134 } ], "source": [ "large_df.describe()" ] }, { "cell_type": "markdown", "metadata": { "id": "-LghiXSTbdks" }, "source": [ "#### Saving & loading\n", "Pandas can save `DataFrame`s to various backends, including file formats such as CSV, Excel, JSON, HTML and HDF5, or to a SQL database. Let's create a `DataFrame` to demonstrate this:" ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "id": "F6dGE_DDbdks", "outputId": "9dcd8f4a-ce2c-4bed-90f1-4be820965e74", "colab": { "base_uri": "https://localhost:8080/", "height": 112 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " hobby weight birthyear children\n", "alice Biking 68.5 1985 NaN\n", "bob Dancing 83.1 1984 3.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hobbyweightbirthyearchildren
aliceBiking68.51985NaN
bobDancing83.119843.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 135 } ], "source": [ "my_df = pd.DataFrame(\n", " [[\"Biking\", 68.5, 1985, np.nan], [\"Dancing\", 83.1, 1984, 3]], \n", " columns=[\"hobby\",\"weight\",\"birthyear\",\"children\"],\n", " index=[\"alice\", \"bob\"]\n", ")\n", "my_df" ] }, { "cell_type": "markdown", "metadata": { "id": "cjas67GYbdks" }, "source": [ "#### Saving\n", "Let's save it to CSV, HTML and JSON:" ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "id": "JUW20lWIbdku" }, "outputs": [], "source": [ "my_df.to_csv(\"my_df.csv\")\n", "my_df.to_html(\"my_df.html\")\n", "my_df.to_json(\"my_df.json\")" ] }, { "cell_type": "markdown", "metadata": { "id": "wkwqS47Obdku" }, "source": [ "Done! Let's take a peek at what was saved:" ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "id": "XiXTvwh6bdku", "outputId": "ee3dd42c-d24d-4659-9abb-9979d4a83650", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "# my_df.csv\n", ",hobby,weight,birthyear,children\n", "alice,Biking,68.5,1985,\n", "bob,Dancing,83.1,1984,3.0\n", "\n", "\n", "# my_df.html\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hobbyweightbirthyearchildren
aliceBiking68.51985NaN
bobDancing83.119843.0
\n", "\n", "# my_df.json\n", "{\"hobby\":{\"alice\":\"Biking\",\"bob\":\"Dancing\"},\"weight\":{\"alice\":68.5,\"bob\":83.1},\"birthyear\":{\"alice\":1985,\"bob\":1984},\"children\":{\"alice\":null,\"bob\":3.0}}\n", "\n" ] } ], "source": [ "for filename in (\"my_df.csv\", \"my_df.html\", \"my_df.json\"):\n", " print(\"#\", filename)\n", " with open(filename, \"rt\") as f:\n", " print(f.read())\n", " print()\n" ] }, { "cell_type": "markdown", "metadata": { "id": "zX9DuLcJbdkv" }, "source": [ "Note that the index is saved as the first column (with no name) in a CSV file, as `` tags in HTML and as keys in JSON.\n", "\n", "Saving to other formats works very similarly, but some formats require extra libraries to be installed. For example, saving to Excel requires the openpyxl library:" ] }, { "cell_type": "code", "execution_count": 138, "metadata": { "id": "DUfZubs3bdkv" }, "outputs": [], "source": [ "try:\n", " my_df.to_excel(\"my_df.xlsx\", sheet_name='People')\n", "except ImportError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": { "id": "A92UjaI8bdkv" }, "source": [ "#### Loading\n", "Now let's load our CSV file back into a `DataFrame`:" ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "id": "6HKE0X1Ybdkv", "outputId": "ae0a5a30-f563-41f9-d4c8-5aa0e379de02", "colab": { "base_uri": "https://localhost:8080/", "height": 112 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " hobby weight birthyear children\n", "alice Biking 68.5 1985 NaN\n", "bob Dancing 83.1 1984 3.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hobbyweightbirthyearchildren
aliceBiking68.51985NaN
bobDancing83.119843.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 139 } ], "source": [ "my_df_loaded = pd.read_csv(\"my_df.csv\", index_col=0)\n", "my_df_loaded" ] }, { "cell_type": "markdown", "metadata": { "id": "-cs24XxSbdkv" }, "source": [ "As you might guess, there are similar `read_json`, `read_html`, `read_excel` functions as well. We can also read data straight from the Internet. For example, let's load the top 1,000 U.S. cities from github:" ] }, { "cell_type": "code", "execution_count": 140, "metadata": { "id": "J6WRgAnObdkv", "outputId": "8e7cf6f5-52e7-4c33-c803-f24b0a60a648", "colab": { "base_uri": "https://localhost:8080/", "height": 238 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " State Population lat lon\n", "City \n", "Marysville Washington 63269 48.051764 -122.177082\n", "Perris California 72326 33.782519 -117.228648\n", "Cleveland Ohio 390113 41.499320 -81.694361\n", "Worcester Massachusetts 182544 42.262593 -71.802293\n", "Columbia South Carolina 133358 34.000710 -81.034814" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StatePopulationlatlon
City
MarysvilleWashington6326948.051764-122.177082
PerrisCalifornia7232633.782519-117.228648
ClevelandOhio39011341.499320-81.694361
WorcesterMassachusetts18254442.262593-71.802293
ColumbiaSouth Carolina13335834.000710-81.034814
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 140 } ], "source": [ "us_cities = None\n", "try:\n", " csv_url = \"https://raw.githubusercontent.com/plotly/datasets/master/us-cities-top-1k.csv\"\n", " us_cities = pd.read_csv(csv_url, index_col=0)\n", " us_cities = us_cities.head()\n", "except IOError as e:\n", " print(e)\n", "us_cities" ] }, { "cell_type": "markdown", "metadata": { "id": "XpWNQD_Jbdkv" }, "source": [ "There are more options available, in particular regarding datetime format. Check out the [documentation](http://pandas.pydata.org/pandas-docs/stable/io.html) for more details." ] }, { "cell_type": "markdown", "metadata": { "id": "yHIObss4bdkv" }, "source": [ "#### Combining `DataFrame`s\n", "\n", "One powerful feature of pandas is it's ability to perform SQL-like joins on `DataFrame`s. Various types of joins are supported: inner joins, left/right outer joins and full joins. To illustrate this, let's start by creating a couple simple `DataFrame`s:" ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "id": "TgNPwsexbdkw", "outputId": "a262863a-4c42-4be4-e9f7-b4bb1fcdd8b5", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " state city lat lng\n", "0 CA San Francisco 37.781334 -122.416728\n", "1 NY New York 40.705649 -74.008344\n", "2 FL Miami 25.791100 -80.320733\n", "3 OH Cleveland 41.473508 -81.739791\n", "4 UT Salt Lake City 40.755851 -111.896657" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statecitylatlng
0CASan Francisco37.781334-122.416728
1NYNew York40.705649-74.008344
2FLMiami25.791100-80.320733
3OHCleveland41.473508-81.739791
4UTSalt Lake City40.755851-111.896657
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 141 } ], "source": [ "city_loc = pd.DataFrame(\n", " [\n", " [\"CA\", \"San Francisco\", 37.781334, -122.416728],\n", " [\"NY\", \"New York\", 40.705649, -74.008344],\n", " [\"FL\", \"Miami\", 25.791100, -80.320733],\n", " [\"OH\", \"Cleveland\", 41.473508, -81.739791],\n", " [\"UT\", \"Salt Lake City\", 40.755851, -111.896657]\n", " ], columns=[\"state\", \"city\", \"lat\", \"lng\"])\n", "city_loc" ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "id": "-F2yDn3cbdkw", "outputId": "0b39866c-fd49-4124-8bf9-26995aab820e", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " population city state\n", "3 808976 San Francisco California\n", "4 8363710 New York New-York\n", "5 413201 Miami Florida\n", "6 2242193 Houston Texas" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationcitystate
3808976San FranciscoCalifornia
48363710New YorkNew-York
5413201MiamiFlorida
62242193HoustonTexas
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 142 } ], "source": [ "city_pop = pd.DataFrame(\n", " [\n", " [808976, \"San Francisco\", \"California\"],\n", " [8363710, \"New York\", \"New-York\"],\n", " [413201, \"Miami\", \"Florida\"],\n", " [2242193, \"Houston\", \"Texas\"]\n", " ], index=[3,4,5,6], columns=[\"population\", \"city\", \"state\"])\n", "city_pop" ] }, { "cell_type": "markdown", "metadata": { "id": "e767etKZbdkw" }, "source": [ "Now let's join these `DataFrame`s using the `merge()` function:" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "id": "Mdztg6KPbdkw", "outputId": "fa5d35a5-120b-47d2-c513-027739974d54", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " state_x city lat lng population state_y\n", "0 CA San Francisco 37.781334 -122.416728 808976 California\n", "1 NY New York 40.705649 -74.008344 8363710 New-York\n", "2 FL Miami 25.791100 -80.320733 413201 Florida" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
state_xcitylatlngpopulationstate_y
0CASan Francisco37.781334-122.416728808976California
1NYNew York40.705649-74.0083448363710New-York
2FLMiami25.791100-80.320733413201Florida
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 143 } ], "source": [ "pd.merge(left=city_loc, right=city_pop, on=\"city\")" ] }, { "cell_type": "markdown", "metadata": { "id": "0aIeJRFNbdkw" }, "source": [ "Note that both `DataFrame`s have a column named `state`, so in the result they got renamed to `state_x` and `state_y`.\n", "\n", "Also, note that Cleveland, Salt Lake City and Houston were dropped because they don't exist in *both* `DataFrame`s. This is the equivalent of a SQL `INNER JOIN`. If you want a `FULL OUTER JOIN`, where no city gets dropped and `NaN` values are added, you must specify `how=\"outer\"`:" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "id": "5p98Bdybbdkw", "outputId": "ccdfddce-9760-464b-d623-08d63cecafd8", "colab": { "base_uri": "https://localhost:8080/", "height": 238 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " state_x city lat lng population state_y\n", "0 CA San Francisco 37.781334 -122.416728 808976.0 California\n", "1 NY New York 40.705649 -74.008344 8363710.0 New-York\n", "2 FL Miami 25.791100 -80.320733 413201.0 Florida\n", "3 OH Cleveland 41.473508 -81.739791 NaN NaN\n", "4 UT Salt Lake City 40.755851 -111.896657 NaN NaN\n", "5 NaN Houston NaN NaN 2242193.0 Texas" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
state_xcitylatlngpopulationstate_y
0CASan Francisco37.781334-122.416728808976.0California
1NYNew York40.705649-74.0083448363710.0New-York
2FLMiami25.791100-80.320733413201.0Florida
3OHCleveland41.473508-81.739791NaNNaN
4UTSalt Lake City40.755851-111.896657NaNNaN
5NaNHoustonNaNNaN2242193.0Texas
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 144 } ], "source": [ "all_cities = pd.merge(left=city_loc, right=city_pop, on=\"city\", how=\"outer\")\n", "all_cities" ] }, { "cell_type": "markdown", "metadata": { "id": "9InNQCuSbdkw" }, "source": [ "Of course `LEFT OUTER JOIN` is also available by setting `how=\"left\"`: only the cities present in the left `DataFrame` end up in the result. Similarly, with `how=\"right\"` only cities in the right `DataFrame` appear in the result. For example:" ] }, { "cell_type": "code", "execution_count": 145, "metadata": { "id": "eJC2h_mAbdkw", "outputId": "3aed3310-57eb-4f63-dce9-072910b94a0a", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " state_x city lat lng population state_y\n", "0 CA San Francisco 37.781334 -122.416728 808976 California\n", "1 NY New York 40.705649 -74.008344 8363710 New-York\n", "2 FL Miami 25.791100 -80.320733 413201 Florida\n", "3 NaN Houston NaN NaN 2242193 Texas" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
state_xcitylatlngpopulationstate_y
0CASan Francisco37.781334-122.416728808976California
1NYNew York40.705649-74.0083448363710New-York
2FLMiami25.791100-80.320733413201Florida
3NaNHoustonNaNNaN2242193Texas
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 145 } ], "source": [ "pd.merge(left=city_loc, right=city_pop, on=\"city\", how=\"right\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Wwp527vqbdkx" }, "source": [ "If the key to join on is actually in one (or both) `DataFrame`'s index, you must use `left_index=True` and/or `right_index=True`. If the key column names differ, you must use `left_on` and `right_on`. For example:" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "id": "t7TB757Ibdkx", "outputId": "100b43cb-0198-4028-a911-f71ec3d20697", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " state_x city lat lng population name \\\n", "0 CA San Francisco 37.781334 -122.416728 808976 San Francisco \n", "1 NY New York 40.705649 -74.008344 8363710 New York \n", "2 FL Miami 25.791100 -80.320733 413201 Miami \n", "\n", " state_y \n", "0 California \n", "1 New-York \n", "2 Florida " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
state_xcitylatlngpopulationnamestate_y
0CASan Francisco37.781334-122.416728808976San FranciscoCalifornia
1NYNew York40.705649-74.0083448363710New YorkNew-York
2FLMiami25.791100-80.320733413201MiamiFlorida
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 146 } ], "source": [ "city_pop2 = city_pop.copy()\n", "city_pop2.columns = [\"population\", \"name\", \"state\"]\n", "pd.merge(left=city_loc, right=city_pop2, left_on=\"city\", right_on=\"name\")" ] }, { "cell_type": "markdown", "metadata": { "id": "GdkK0fRRbdkx" }, "source": [ "#### Concatenation\n", "Rather than joining `DataFrame`s, we may just want to concatenate them. That's what `concat()` is for:" ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "id": "OB8vX0C7bdkx", "outputId": "a074f2b3-52f3-4491-cf15-c540c5ab310e", "colab": { "base_uri": "https://localhost:8080/", "height": 332 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " state city lat lng population\n", "0 CA San Francisco 37.781334 -122.416728 NaN\n", "1 NY New York 40.705649 -74.008344 NaN\n", "2 FL Miami 25.791100 -80.320733 NaN\n", "3 OH Cleveland 41.473508 -81.739791 NaN\n", "4 UT Salt Lake City 40.755851 -111.896657 NaN\n", "3 California San Francisco NaN NaN 808976.0\n", "4 New-York New York NaN NaN 8363710.0\n", "5 Florida Miami NaN NaN 413201.0\n", "6 Texas Houston NaN NaN 2242193.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statecitylatlngpopulation
0CASan Francisco37.781334-122.416728NaN
1NYNew York40.705649-74.008344NaN
2FLMiami25.791100-80.320733NaN
3OHCleveland41.473508-81.739791NaN
4UTSalt Lake City40.755851-111.896657NaN
3CaliforniaSan FranciscoNaNNaN808976.0
4New-YorkNew YorkNaNNaN8363710.0
5FloridaMiamiNaNNaN413201.0
6TexasHoustonNaNNaN2242193.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 147 } ], "source": [ "result_concat = pd.concat([city_loc, city_pop])\n", "result_concat" ] }, { "cell_type": "markdown", "metadata": { "id": "mNaralK8bdkx" }, "source": [ "Note that this operation aligned the data horizontally (by columns) but not vertically (by rows). In this example, we end up with multiple rows having the same index (eg. 3). Pandas handles this rather gracefully:" ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "id": "VDFrmNMGbdkx", "outputId": "a8e2c6b4-9ccd-49eb-bca3-add0cdc858c8", "colab": { "base_uri": "https://localhost:8080/", "height": 112 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " state city lat lng population\n", "3 OH Cleveland 41.473508 -81.739791 NaN\n", "3 California San Francisco NaN NaN 808976.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statecitylatlngpopulation
3OHCleveland41.473508-81.739791NaN
3CaliforniaSan FranciscoNaNNaN808976.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 148 } ], "source": [ "result_concat.loc[3]" ] }, { "cell_type": "markdown", "metadata": { "id": "oSp4G88tbdkx" }, "source": [ "Or you can tell pandas to just ignore the index:" ] }, { "cell_type": "code", "execution_count": 149, "metadata": { "id": "BrVK8LtDbdkx", "outputId": "e601d609-30e6-4bda-8e70-fbc317e74477", "colab": { "base_uri": "https://localhost:8080/", "height": 332 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " state city lat lng population\n", "0 CA San Francisco 37.781334 -122.416728 NaN\n", "1 NY New York 40.705649 -74.008344 NaN\n", "2 FL Miami 25.791100 -80.320733 NaN\n", "3 OH Cleveland 41.473508 -81.739791 NaN\n", "4 UT Salt Lake City 40.755851 -111.896657 NaN\n", "5 California San Francisco NaN NaN 808976.0\n", "6 New-York New York NaN NaN 8363710.0\n", "7 Florida Miami NaN NaN 413201.0\n", "8 Texas Houston NaN NaN 2242193.0" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statecitylatlngpopulation
0CASan Francisco37.781334-122.416728NaN
1NYNew York40.705649-74.008344NaN
2FLMiami25.791100-80.320733NaN
3OHCleveland41.473508-81.739791NaN
4UTSalt Lake City40.755851-111.896657NaN
5CaliforniaSan FranciscoNaNNaN808976.0
6New-YorkNew YorkNaNNaN8363710.0
7FloridaMiamiNaNNaN413201.0
8TexasHoustonNaNNaN2242193.0
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 149 } ], "source": [ "pd.concat([city_loc, city_pop], ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "rHrwIwkobdkx" }, "source": [ "Notice that when a column does not exist in a `DataFrame`, it acts as if it was filled with `NaN` values. If we set `join=\"inner\"`, then only columns that exist in *both* `DataFrame`s are returned:" ] }, { "cell_type": "code", "execution_count": 150, "metadata": { "id": "FlkwidKubdky", "outputId": "1e33194a-f709-4fe7-fde7-ba3282f6ef94", "colab": { "base_uri": "https://localhost:8080/", "height": 332 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " state city\n", "0 CA San Francisco\n", "1 NY New York\n", "2 FL Miami\n", "3 OH Cleveland\n", "4 UT Salt Lake City\n", "3 California San Francisco\n", "4 New-York New York\n", "5 Florida Miami\n", "6 Texas Houston" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
statecity
0CASan Francisco
1NYNew York
2FLMiami
3OHCleveland
4UTSalt Lake City
3CaliforniaSan Francisco
4New-YorkNew York
5FloridaMiami
6TexasHouston
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 150 } ], "source": [ "pd.concat([city_loc, city_pop], join=\"inner\")" ] }, { "cell_type": "markdown", "metadata": { "id": "fMXy67Thbdkz" }, "source": [ "#### Categories\n", "It is quite frequent to have values that represent categories, for example `1` for female and `2` for male, or `\"A\"` for Good, `\"B\"` for Average, `\"C\"` for Bad. These categorical values can be hard to read and cumbersome to handle, but fortunately pandas makes it easy. To illustrate this, let's take the `city_pop` `DataFrame` we created earlier, and add a column that represents a category:" ] }, { "cell_type": "code", "execution_count": 151, "metadata": { "id": "ca9MbHkJbdkz", "outputId": "0f9909f0-c306-4e30-a606-a02af7f72f5f", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " population city state eco_code\n", "3 808976 San Francisco California 17\n", "4 8363710 New York New-York 17\n", "5 413201 Miami Florida 34\n", "6 2242193 Houston Texas 20" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationcitystateeco_code
3808976San FranciscoCalifornia17
48363710New YorkNew-York17
5413201MiamiFlorida34
62242193HoustonTexas20
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 151 } ], "source": [ "city_eco = city_pop.copy()\n", "city_eco[\"eco_code\"] = [17, 17, 34, 20]\n", "city_eco" ] }, { "cell_type": "markdown", "metadata": { "id": "Z8kMBnYCbdkz" }, "source": [ "Right now the `eco_code` column is full of apparently meaningless codes. Let's fix that. First, we will create a new categorical column based on the `eco_code`s:" ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "id": "3Vvut-u9bdkz", "outputId": "f731bdc0-5498-48bf-dc40-78345578e38c", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Int64Index([17, 20, 34], dtype='int64')" ] }, "metadata": {}, "execution_count": 152 } ], "source": [ "city_eco[\"economy\"] = city_eco[\"eco_code\"].astype('category')\n", "city_eco[\"economy\"].cat.categories" ] }, { "cell_type": "markdown", "metadata": { "id": "8NwGaKhebdkz" }, "source": [ "Now we can give each category a meaningful name:" ] }, { "cell_type": "code", "execution_count": 153, "metadata": { "id": "NEJS59vnbdkz", "outputId": "ef85fa21-7e8b-49ed-ceb8-56a8982e8e75", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " population city state eco_code economy\n", "3 808976 San Francisco California 17 Finance\n", "4 8363710 New York New-York 17 Finance\n", "5 413201 Miami Florida 34 Tourism\n", "6 2242193 Houston Texas 20 Energy" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationcitystateeco_codeeconomy
3808976San FranciscoCalifornia17Finance
48363710New YorkNew-York17Finance
5413201MiamiFlorida34Tourism
62242193HoustonTexas20Energy
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 153 } ], "source": [ "city_eco[\"economy\"].cat.categories = [\"Finance\", \"Energy\", \"Tourism\"]\n", "city_eco" ] }, { "cell_type": "markdown", "metadata": { "id": "b5l_n08-bdkz" }, "source": [ "Note that categorical values are sorted according to their categorical order, *not* their alphabetical order:" ] }, { "cell_type": "code", "execution_count": 154, "metadata": { "id": "r5e0jca6bdkz", "outputId": "7470c030-bcc8-475b-e59b-0d2348f38741", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " population city state eco_code economy\n", "5 413201 Miami Florida 34 Tourism\n", "6 2242193 Houston Texas 20 Energy\n", "3 808976 San Francisco California 17 Finance\n", "4 8363710 New York New-York 17 Finance" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationcitystateeco_codeeconomy
5413201MiamiFlorida34Tourism
62242193HoustonTexas20Energy
3808976San FranciscoCalifornia17Finance
48363710New YorkNew-York17Finance
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 154 } ], "source": [ "city_eco.sort_values(by=\"economy\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "-c0f9fnEbdk0" }, "source": [ "## What next?\n", "As you probably noticed by now, pandas is quite a large library with *many* features. Although we went through the most important features, there is still a lot to discover. Probably the best way to learn more is to get your hands dirty with some real-life data. It is also a good idea to go through pandas' excellent [documentation](http://pandas.pydata.org/pandas-docs/stable/index.html), in particular the [Cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html).\n", "\n", "You can also work with Bigquery in Panda. Check out https://googleapis.dev/python/bigquery/latest/usage/pandas.html and https://pandas-gbq.readthedocs.io/en/latest/ for more details." ] }, { "cell_type": "code", "source": [ "" ], "metadata": { "id": "MB6TYLpobQzG" }, "execution_count": null, "outputs": [] } ] }